reading only few fields from a text file with multiple delimiter - python

I have a text file with multiple delimiters separating the value. from that I just want to read the pipe separated values
data is like this for example:
'
10|10|10|10|10|10|10|10|10;10:10:10,10,10,10 ... etc
'
I want to read only upto the 8 pipe separated values as a dataframe and ignore the values with ";,:". How do I do that?

It would be a two step process. First read csv with | as delimiter
df = pd.read_csv(StringIO(
"10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
), delimiter='|', header=None)
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10;10:10:10,10,10,10
Then update the last column by removing the string coming after [;,:]
df.iloc[:, -1] = df.iloc[:, -1].str.replace(r'[;,:].*', '', regex=True)
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10
If you know the exact character after which you have to ignore, you can use comment attribute as follows. Everything after that 1 char string would be ignored.
df = pd.read_csv(StringIO(
"10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
), delimiter='|', header=None, comment=';')
df
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10

This is longer than other proposed solutions, but also possibly faster because it only reads what's needed. It collects the result as a list but it could be another container type :
df = "10,10,10,10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
coll = []
start = 0
prevIdx = -1
while True:
try:
idx = df.index("|", start)
if prevIdx >= 0:
n = int(df[prevIdx+1:idx])
if isinstance(n, int): coll.append(n)
start = idx+1
prevIdx = idx
except:
break;
print(coll) # ==> [10, 10, 10, 10, 10, 10, 10]

Related

Convert a series of number become one single line of numbers

If I have a series of numbers in a DataFrame with one column, e.g.:
import pandas as pd
data = [4, 5, 6, 7, 8, 9, 10, 11]
pd.DataFrame(data)
Which looks like this (left column = index, right column = data):
0 4
1 5
2 6
3 7
4 8
5 9
6 10
7 11
How do I make it into one sequence number, so (4 5 6 7 8 9 10 11) in python or pandas ?
because i want to put that into xml file so it looks like this
<Or>
<numbers>
<example>4 5 6 7 8 9 10 11</example>
</numbers>
</Or>
You can use a f-string with conversion of the integers to string and str.join:
text = f''' <Or>
<numbers>
<example>{" ".join(s.astype(str))}</example>
</numbers>
</Or>'''
Output:
<Or>
<numbers>
<example>4 5 6 7 8 9 10 11</example>
</numbers>
</Or>
use tolist() to convert the column value to list and then you can join the list with str(...)
marks_list = (df['Marks'].tolist())
marks_str = (' '.join(str(x) for x in marks_list))

How to spilt each column into more columns in a given dataframe

I have over 100 columns of the week and for each column of the week, I want to proportion it into days and assign row-specific values to each row over 7 new columns. Like this
I am new to python, I know I need a while loops and for loops but now sure how to go about doing this. Can anyone help?
Based on previous advice I had from this forum, the below work for Week1, can someone advise me on how to loop each week for weeks 2, 3, 4 to nth week?
import pandas as pd
df = pd.DataFrame({"Week1": [9, 30, 35, 65],"Week2": [20, 10, 25, 55],"Week3": [19, 35, 40, 15],"Week4": [7, 10, 70, 105]})
# define which columns need to be created
# this will be the range between 1 and the maximum of the Total Number column
columns_to_fill = ["col" + str(i) for i in range(1, 8)]
# columns_to_fill = [ col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, .... , col28 ]
# now, go through each row of your dataframe
for indx, row in df.iterrows():
# and for each column in the new columns to be filled
# check if the number is smaller or equal than the row's Total Number
# if it is smaller, fill the column with 1
# else fill the column with 0
for number, column in enumerate(columns_to_fill):
if number + 1 <= row["Week1"]:
df.loc[indx, column] = 1
else:
df.loc[indx, column] = 0
# now check if there is a remainder
remainder = row["Week1"] - 7
# while remainder is greater than 0
# we need to continue adding +1 to the columns
while remainder > 0:
for number, column in enumerate(columns_to_fill):
if number + 1 <= remainder:
df.loc[indx, column] += 1
else:
continue
# update remainder
remainder = remainder - 7
Here is a vectorized option, first repeat each row 7 times (number of days a week), and add an extra index level with set_index being the day number.
_df = (
df.loc[df.index.repeat(7)]
.set_index(np.array(list(range(1,8))*len(df)), append=True)
)
print(_df.head(10))
# Week1 Week2 Week3 Week4
# 0 1 9 20 19 7
# 2 9 20 19 7
# 3 9 20 19 7
# 4 9 20 19 7
# 5 9 20 19 7
# 6 9 20 19 7
# 7 9 20 19 7
# 1 1 30 10 35 10
# 2 30 10 35 10
# 3 30 10 35 10
now calculate the result of the entire division with //7, then add the rest where needed using the modulo % that you can compare with the extra index level created as it is the day number.
# entire division
res = _df//7
# add the rest where needed
res += (_df%7 >= _df.index.get_level_values(1).to_numpy()[:, None]).astype(int)
print(res)
# Week1 Week2 Week3 Week4
# 0 1 2 3 3 1
# 2 2 3 3 1
# 3 1 3 3 1
# 4 1 3 3 1
# 5 1 3 3 1
# 6 1 3 2 1
# 7 1 2 2 1
# 1 1 5 2 5 2
# 2 5 2 5 2
# 3 4 2 5 2
Finally, reshape and rename columns if wanted.
# reshape the result
res = res.unstack()
# rename the columns if you don't want multiindex
res.columns = [f'{w}_col{i}' for w, i in res.columns]
print(res)
# Week1_col1 Week1_col2 Week1_col3 Week1_col4 Week1_col5 Week1_col6 \
# 0 2 2 1 1 1 1
# 1 5 5 4 4 4 4
# 2 5 5 5 5 5 5
# 3 10 10 9 9 9 9
# Week1_col7 Week2_col1 Week2_col2 Week2_col3 Week2_col4 Week2_col5 \
# 0 1 3 3 3 3 3
# 1 4 2 2 2 1 1
# ...
and you can still join to your original dataframe
res = df.join(res)
If you want ot use while/for loops, this iterates over the original data frame. The data frame rows can be of any length and have any number of header elements. The sub-header can have any number of elements (1D).
#Import
import numpy as np
import pandas as pd
#Example data frame and sub-header.
df = pd.DataFrame({"Week1": [9, 30, 35, 65],"Week2": [20, 10, 25, 55],"Week3": [19, 35, 40, 15],"Week4": [7, 10, 70, 105]})
subHeader = ['day1','day2','day3','day4','day5','day6','day7']
#Sort data frame and sub header.
df = df.reindex(sorted(df.columns), axis=1)
subHeader.sort()
#Extract relevant variables.
cols = df.shape[1]
rows = df.shape[0]
subHeadLen = len(subHeader)
mainHeader = list(df.columns)
meanHeadLen = len(mainHeader)
#MultiIndex main header with sub-header.
header = pd.MultiIndex.from_product([mainHeader,subHeader], names=['Week','Day'])
#Hold vals in temporary matrix.
mat = np.zeros((rows,meanHeadLen*subHeadLen))
#Iterate over data frame weeks. For every value in each row distribute over matrix indices by incrementing elements daily.
for col in range(cols):
for val in range(rows):
while df.iat[val,col] > 0:
for subVal in range(subHeadLen):
if df.iat[val,col] > 0:
mat[val][col*subHeadLen + subVal] = mat[val][col*subHeadLen + subVal] + 1
df.iat[val,col] = df.iat[val,col] - 1
#Final data frame.
df2 = pd.DataFrame(mat,columns=header)
print(df2)

Creation dataframe from several list of lists

I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)
Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Pandas Dataframe removes to much rows

I have a dataframe with a lot of tweets and i want to remove the duplicates. The tweets are stored in fh1.df['Tweets']. i counts the amount of non-duplicates. j the amount of duplicates. In the else statement I remove the lines of the duplicates. And in the if I make a new list "tweetChecklist" where I put all the good tweets in.
Ok, if I do i + j , i become the amount of original tweets. So that's good. But in the else, I don't know why, he removes to much rows because the shape of my dataframe is much smaller after the for loop (1/10).
How does the " fh1.df = fh1.df[fh1.df.Tweets != current_tweet]
" line remove to much rows??
tweetChecklist = []
for current_tweet in fh1.df['Tweets']:
if current_tweet not in tweetChecklist:
i = i + 1
tweetChecklist.append(current_tweet)
else:
j = j + 1
fh1.df = fh1.df[fh1.df.Tweets != current_tweet]
fh1.df['Tweets'] = pd.Series(tweetChecklist)
NOTE
Graipher's solution tells you how to generate a unique dataframe. My answer tells you why your current operation removes too many rows (per your question).
END NOTE
When you enter the "else" statement to remove the duplicated tweet you are removing ALL of the rows that have the specified tweet. Let's demonstrate:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.randint(0, 10, (10, 5)), columns=list('ABCDE'))
What does this make:
Out[118]:
A B C D E
0 2 7 0 5 4
1 2 8 8 3 7
2 9 7 4 6 2
3 9 7 7 9 2
4 6 5 7 6 8
5 8 8 7 6 7
6 6 1 4 5 3
7 1 4 7 8 7
8 3 2 5 8 5
9 5 8 9 2 4
In your method (assume you want to remove duplicates from "A" instead of "Tweets") you would end up with (i.e. only have rows that were not unique).
Out[118]:
A B C D E
5 8 8 7 6 7
7 1 4 7 8 7
8 3 2 5 8 5
9 5 8 9 2 4
If you just want to make this unique, implement Graipher's suggestion. If you want to count how many duplicates you have you can do this:
total = df.shape[0]
duplicates = total - df.A.unique().size
In pandas there is usually always a better way than iterating over the dataframe with a for loop.
In this case, what you really want is to group equal tweets together and just retain the first one. This can be achieved with pandas.DataFrame.groupby:
import random
import string
import pandas as pd
# some random one character tweets, so there are many duplicates
df = pd.DataFrame({"Tweets": random.choices(string.ascii_lowercase, k=100),
"Data": [random.random() for _ in range(100)]})
df.groupby("Tweets", as_index=False).first()
# Tweets Data
# 0 a 0.327766
# 1 b 0.677697
# 2 c 0.517186
# 3 d 0.925312
# 4 e 0.748902
# 5 f 0.353826
# 6 g 0.991566
# 7 h 0.761849
# 8 i 0.488769
# 9 j 0.501704
# 10 k 0.737816
# 11 l 0.428117
# 12 m 0.650945
# 13 n 0.530866
# 14 o 0.337835
# 15 p 0.567097
# 16 q 0.130282
# 17 r 0.619664
# 18 s 0.365220
# 19 t 0.005407
# 20 u 0.905659
# 21 v 0.495603
# 22 w 0.511894
# 23 x 0.094989
# 24 y 0.089003
# 25 z 0.511532
Even better, there is even a function explicitly for that, pandas.drop_duplicates, which is about twice as fast:
df.drop_duplicates(subset="Tweets", keep="first")

How to update string in python in existing data

I have below command output:
data = """
abcd11 11
abcd12 12
abcd13 13
abcd21 14
abcd22 15
abcd23 16
abcd101 17
abcd102 18
abcd103 19
... so on
abcd501 1
abcd502 2
"""
Condition1 Numbers (As per data it is string) range must be between 1 to 255, that is not exceed 255,
Code:
#Check abcd401, abcd402, abcd403
check = set()
for line in data.split("\n"):
if len(line.split()) > 1:
line = line.strip()
check.add(line.split()[0])
if not "abcd401" in check and not "abcd402" in check and not "abcd402" in check:
print "Not exist"
else:
print "Its already exist. Program exit"
sys.exit()
Now Need to assign a Number to abcd401, abcd402, abcd403
Number between 1 to 255.
I can always assign abcd401 = 1, abcd402 = 2, abcd403 = 3, but i need to fill 1-255, then starts 1-255, and so on Please help.
I am trying to solve adding the line if not exist in the data (data
is multiline string input content).
This can be done using pandas
with the easy way, also solution will scales if you use pandas.
I did
not focus on the randomizing the number assigned to each line, I just
cycling from 1 to 255 and once idx is reached I am starting from 1
again. :)
this part you can take care...
My quick solution is:
import cStringIO as io
import pandas as pd
from itertools import cycle
text_data = """abcd11 11
abcd12 12
abcd13 13
abcd21 14
abcd22 15
abcd23 16
abcd101 17
abcd102 18
abcd103 19
abcd501 1
abcd502 2"""
def get_next_id():
cyc = cycle(range(1,256))
for i in cyc:
yield i
next_id = get_next_id()
def load_data():
content = io.StringIO(text_data)
df = pd.read_csv(content, header=None, sep="\s+", names=["txt", "num"])
print "Content is loaded to pandas dataframe\n", df
return df
def add_line_to_df(txt, df):
idx = next_id.next()
df2 = pd.DataFrame([[txt, idx]], columns=["txt", "num"])
df.loc[len(df.index)] = df2.iloc[0]
#print df #res_df
return df # res_df
def insert_valid_line(line, df):
if line in pd.Series(df["txt"]).values:
print "{}: already existed.".format(line)
else:
print "{}: adding to existing df.".format(line)
add_line_to_df(line, df)
def main():
df = load_data()
new_texts = ["abcd501", "abcd502", "abcd402", "abcd403"]
for txt in new_texts:
print "-" * 20
insert_valid_line(txt, df)
print "-" * 20
print df
#In this place df is holding all the data
if __name__ == '__main__':
main()
Output looks like this...:
Content is loaded to pandas dataframe
txt num
0 abcd11 11
1 abcd12 12
2 abcd13 13
3 abcd21 14
4 abcd22 15
5 abcd23 16
6 abcd101 17
7 abcd102 18
8 abcd103 19
9 abcd501 1
10 abcd502 2
--------------------
abcd501: already existed.
--------------------
abcd502: already existed.
--------------------
abcd402: adding to existing df.
--------------------
abcd403: adding to existing df.
--------------------
txt num
0 abcd11 11
1 abcd12 12
2 abcd13 13
3 abcd21 14
4 abcd22 15
5 abcd23 16
6 abcd101 17
7 abcd102 18
8 abcd103 19
9 abcd501 1
10 abcd502 2
11 abcd402 1
12 abcd403 2

Categories

Resources