I am learning python.
For the code below, how to convert for loop to while loop in an efficient way?
import pandas as pd
transactions01 = []
file=open('raw-data1.txt','w')
file.write('HotDogs,Buns\nHotDogs,Buns\nHotDogs,Coke,Chips\nChips,Coke\nChips,Ketchup\nHotDogs,Coke,Chips\n')
file.close()
file=open('raw-data1.txt','r')
lines = file.readlines()
for line in lines:
items = line[:-1].split(',')
has_item = {}
for item in items:
has_item[item] = 1
transactions01.append(has_item)**
file.close()
data = pd.DataFrame(transactions01)
data.fillna(0, inplace = True)
data
Code :
i = 0
while i<len(lines):
items = lines[i][:-1].split(',')
has_item = {}
j = 0
while j<len(items):
has_item[items[j]]=1
j+=1
transactions01.append(has_item)
i+=1
It looks like you could just take your, use the csv module to parse the file as you've got an inconsistent number of rows per column, then turn it into a dataframe, use pd.get_dummies to get 0/1's per item present, then aggregate back to a row level to product your final output, eg:
import pandas as pd
import csv
with open('raw-data1.txt') as fin:
df = pd.get_dummies(pd.DataFrame(csv.reader(fin)).stack()).groupby(level=0).max()
Will give you a df of:
Buns Chips Coke HotDogs Ketchup
0 1 0 0 1 0
1 1 0 0 1 0
2 0 1 1 1 0
3 0 1 1 0 0
4 0 1 0 0 1
5 0 1 1 1 0
.. which you can then write back out as CSV if required.
Related
I have an excel sheet which consists of 2 columns. The first keywords and the second is Url.
I am making a script to extract groups which shares the same 3 URLs or more.
I wrote the below code but it takes around an hour to process the main function on a huge excel sheet.
import pandas as pd
import numpy as np
import time
loop = 1
numerator = 0
continuee= []
df_list = []
for index in list(df.sort_values('Url').set_index('Url').index.unique()):
if len(df.sort_values('Url').set_index('Url').loc[index].values) == 1:
list1 = list(df.sort_values('Url').set_index('Url').loc[index].values)
elif len(df.sort_values('Url').set_index('Url').loc[index].keywords.values) > 1:
list1 = list(df.sort_values('Url').set_index('Url').loc[index].keywords.values)
df1 = df[df.keywords.isin(list1)]
df1 = df1[df1.Url.duplicated(keep=False)]
df1 = df1.groupby('Url').filter(lambda x: x.Url.value_counts() == df1.keywords.nunique())
df1 = df1.groupby('keywords').filter(lambda x: x.keywords.value_counts() >= 3)
df1 = df1.groupby('Url').filter(lambda x: x.Url.value_counts() == df1.keywords.nunique())
if df1.keywords.nunique() > 1:
silos = list(df1.keywords.unique())
df_list.append({numerator:silos})
word = word[~(word.isin(silos))]
numerator += 1
else:
singles = list(word[word.keywords.isin(list1)].keywords.unique())
df_list.append({"single" : singles})
word = word[~(word.isin(singles))]
print(loop)
loop += 1
trial = pd.DataFrame(df_list)
if 'single' in list(trial.columns):
for i in list(word.keywords.unique()):
if i not in list(trial.single):
df_list.append({"single" : i})
else:
for i in list(word.keywords.unique()):
df_list.append({"single" : i})
trial = pd.DataFrame(df_list)
I tried many times to use multiprocessing but I failed as I am not really getting how it works with Pandas. Is there a way to help me, please? Also, if I wanted to pass another couple of functions how would I do it? Many thanks in advance.
From what I can gather, this should be your solution;
by_size = df.groupby(df.columns.tolist()).size().reset_index()
three_or_more=by_size[by_size[0]>=3].iloc[:,:-1]
Example:
>>> df
keyword url
0 2 2
1 4 3
2 2 1
3 4 3
4 1 1
5 2 1
6 4 1
7 2 1
8 1 1
9 3 3
>>> by_size = df.groupby(df.columns.tolist()).size().reset_index()
>>> by_size
keyword url 0
0 1 1 2
1 2 1 3
2 2 2 1
3 3 3 1
4 4 1 1
5 4 3 2
>>> three_or_more=by_size[by_size[0]>=3].iloc[:,:-1]
>>> three_or_more
keyword url
1 2 1
There are several files like this:
sample_a.txt containing:
a
b
c
sample_b.txt containing:
b
w
e
sample_c.txt containing:
a
m
n
I want to make a matrix of absence/presence like this:
a b c w e m n
sample_a 1 1 1 0 0 0 0
sample_b 0 1 0 1 1 0 0
sample_c 1 0 0 0 0 1 1
I know a dirty and dumb way how to solve it: make up a list of all possible letters in those files, and then iteratively comparing each line of each file with this 'library' fill in the final matrix by index. But I guess there's a smarter solution. Any ideas?
Upd:
the sample files can be of different length.
You can try:
import pandas as pd
from collections import defaultdict
dd = defaultdict(list) # dictionary where each value per key is a list
files = ["sample_a.txt","sample_b.txt","sample_c.txt"]
for file in files:
with open(file,"r") as f:
for row in f:
dd[file.split(".")[0]].append(row[0])
#appending to dictionary dd:
#KEY: file.split(".")[0] is file name without extension
#VALUE: row[0] is first character of line in text file
# (second character was new line '\n' so I removed it)
df = pd.DataFrame.from_dict(dd, orient='index').T.melt() #converting dictionary to long format of dataframe
pd.crosstab(df.variable, df.value) #make crosstab, similar to pd.pivot_table
result:
value a b c e f m n o p w
variable
sample_a 1 1 1 0 0 0 0 0 0 0
sample_b 0 1 0 1 1 0 0 0 0 1
sample_c 1 0 0 0 0 1 1 1 1 0
Please note letters (columns) are in alphabetical order.
I have a pandas dataframe and I want to loop over the last column "n" times based on a condition.
import random as random
import pandas as pd
p = 0.5
df = pd.DataFrame()
start = []
for i in range(5)):
if random.random() < p:
start.append("0")
else:
start.append("1")
df['start'] = start
print(df['start'])
Essentially, I want to loop over the final column "n" times and if the value is 0, change it to 1 with probability p so the results become the new final column. (I am simulating on-off every time unit with probability p).
e.g. after one iteration, the dataframe would look something like:
0 0
0 1
1 1
0 0
0 1
after two:
0 0 1
0 1 1
1 1 1
0 0 0
0 1 1
What is the best way to do this?
Sorry if I am asking this wrong, I have been trying to google for a solution for hours and coming up empty.
Like this. Append col with name 1, 2, ...
# continue from question code ...
# colname is 1, 2, ...
for col in range(1, 5):
tmp = []
for i in range(5):
# check final col
if df.iloc[i,col-1:col][0] == "0":
if random.random() < p:
tmp.append("0")
else:
tmp.append("1")
else: # == 1
tmp.append("1")
# append new col
df[str(col)] = tmp
print(df)
# initial
s
0 0
1 1
2 0
3 0
4 0
# result
s 1 2 3 4
0 0 0 1 1 1
1 0 0 0 0 1
2 0 0 1 1 1
3 1 1 1 1 1
4 0 0 0 0 0
I have a text file which looks like this:
~Date and Time of Data Converting: 15.02.2019 16:12:44
~Name of Test: XXX
~Address: ZZZ
~ID: OPP
~Testchannel: CH06
~a;b;DateTime;c;d;e;f;g;h;i;j;k;extract;l;m;n;o;p;q;r
0;1;04.03.2019 07:54:19;0;0;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;3;0;0;0;0;0
5,5523894132E-7;2;04.03.2019 07:54:19;5,5523894132E-7;5,5523894132E-7;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;0;0;0;0;0;0
0,00277777777779538;3;04.03.2019 07:54:29;0,00277777777779538;0,00277777777779538;2;Pause;3,5724446855812;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,00555555532278617;4;04.03.2019 07:54:39;0,00555555532278617;0,00555555532278617;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;1;0;0;0;0;0
0,00833333333338613;5;04.03.2019 07:54:49;0,00833333333338613;0,00833333333338613;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,0111112040002119;6;04.03.2019 07:54:59;0,0111112040002119;0,0111112040002119;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,013888887724954;7;04.03.2019 07:55:09;0,013888887724954;0,013888887724954;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
I need to extract the values from the column named extract, and need to store the output as an excel file.
Can anyone give me any idea how I can proceed?
So far, I have only been able to create an empty excel file for the output, and I have read the text file. I however don't know how to append output to the empty excel file.
import os
file=open('extract.csv', "a")
if os.path.getsize('extract.csv')==0:
file.write(" "+";"+"Datum"+";"+"extract"+";")
with open('myfile.txt') as f:
dat=[f.readline() for x in range(10)]
datum=dat[7].split(' ')[3]
data = np.genfromtxt('myfile.txt', delimiter=';', skip_header=12,dtype=str)
You can use the pandas module.
You need to read skip the first lines of your text file. Here, I consider not to know how many there are. I loop until I find a data row.
Then read the data.
Finaly, export it as dataframe with to_excel (doc)
Here the code:
# Import module
import pandas as pd
# Read file
with open('temp.txt') as f:
content = f.read().split("\n")
# Skip the first lines (find number start data)
for i, line in enumerate(content):
if line and line[0] != '~': break
# Columns names and data
header = content[i - 1][1:].split(';')
data = [row.split(';') for row in content[i:]]
# Store in dataframe
df = pd.DataFrame(data, columns=header)
print(df)
# a b DateTime c d e f ... l m n o p q r
# 0 0 1 04.03.2019 07:54:19 0 0 2 Pause ... 1 3 0 0 0 0 0
# 1 5,5523894132E-7 2 04.03.2019 07:54:19 5,5523894132E-7 5,5523894132E-7 2 Pause ... 1 0 0 0 0 0 0
# 2 0,00277777777779538 3 04.03.2019 07:54:29 0,00277777777779538 0,00277777777779538 2 Pause ... 1 1 0 0 0 0 0
# 3 0,00555555532278617 4 04.03.2019 07:54:39 0,00555555532278617 0,00555555532278617 2 Pause ... 1 1 0 0 0 0 0
# 4 0,00833333333338613 5 04.03.2019 07:54:49 0,00833333333338613 0,00833333333338613 2 Pause ... 1 1 0 0 0 0 0
# 5 0,0111112040002119 6 04.03.2019 07:54:59 0,0111112040002119 0,0111112040002119 2 Pause ... 1 1 0 0 0 0 0
# 6 0,013888887724954 7 04.03.2019 07:55:09 0,013888887724954 0,013888887724954 2 Pause ... 1 1 0 0 0 0 0
# Select only the Extract column
# df = df.Extract
# Save the data in excel file
df.to_excel("OutPut.xlsx", "MySheetName", index=False)
Note: if you know the number of lines to skip, you can simply load the dataframe with read_csv using the skiprows parameter. (doc).
Hope that helps!
I am quite new to programming (Python) and I am trying to write a script in python that compares the values in two separate files such that if the value is the same it assigns 0, and it the value is different it assigns 1.
Say the both initial files are 4rows by 3 columns, so the final file will be a 4rows by 3 columns file of just 1’s and 0’s.
Also, I'd like to sum all the values in this new file (that is summing all the 1’s together).
I have checked around, and I have come across functions such as 'difflib', however I don't know if that'll be suitable.
I am wondering if anyone can help out with something simple...
Thanks a lot in advance :)
The both files shown below consist of 5rows and 6columns
File 1 (ain.txt)
0 1 0 1 0 0
0 0 0 0 0 0
0 1 0 1 0 0
0 0 0 0 0 0
0 1 0 1 0 0
File 2 (bin.txt)
1 1 1 1 1 0
1 1 1 1 1 0
1 1 1 1 1 0
1 1 1 1 1 0
1 1 1 1 1 0
The script below outputs True and False...
import numpy as np
infile = np.loadtxt('ain.txt')
data = np.array(infile)
infile1 = np.loadtxt('bin.txt')
data1 = np.array(infile1)
index = (data==data1)
np.savetxt('comparrr.txt', (index), delimiter = ' ', fmt='%s')
The output shown below:
comparrr.txt
FALSE TRUE FALSE TRUE FALSE TRUE
FALSE FALSE FALSE FALSE FALSE TRUE
FALSE TRUE FALSE TRUE FALSE TRUE
FALSE FALSE FALSE FALSE FALSE TRUE
FALSE TRUE FALSE TRUE FALSE TRUE
However I would want the "FALSE" to be represented by values of 1, and the "TRUE" by values by 0.
I hope this clarifies my question.
Thanks very much in advance.
Sorry for all the troubles, I found out the issue with the previous script above was the format I chose (fmt='%s')... changing that to (fmt='%d') gives the output as 1's and 0's... however I want to have them flipped (i.e. the 1's become 0's, and the 0's become 1's)
Thanks
The output after the change in format mentioned above, shown below:
0 1 0 1 0 1
0 0 0 0 0 1
0 1 0 1 0 1
0 0 0 0 0 1
0 1 0 1 0 1
EDIT: Ok, updating answer
You don't need to import numpy to solve this problem.
If you open the files in iter() they will be read line by line as strings. You can use split() to make them into a list and then use zip() and list comps to quickly figure out if they're equal or not. Then you can turn it back into a string(with map() and join()) and toss it into the file.
foo1 = iter(open('foo1', 'r'))
foo2 = iter(open('foo2', 'r'))
outArr = [ [0 if p==q else 1 for p,q in zip(i.split(), j.split()) ] for i,j in zip(foo1,foo2) ]
totalSum = sum([ sum(row) for row in outArr ])
with open('outFile', 'w') as out:
for row in outArr:
out.write(' '.join(map(str,row))+'\n')
In regards to your code--while the index = (data==data1) bit technically works because of how numpy arrays work, it isn't very readable in my opinion.
To invert your array, numpy provides invert which can be applied directly to the numpy array as np.invert(index). Also, np.loadtxt() returns an np.ndarray type, you don't need to reassign it. To make your code work as you have outlined I would do the following...
import numpy as np
infile = np.loadtxt('foo1')
infile1 = np.loadtxt('foo2')
index = np.invert(infile==infile1).astype(int)
totalSum = sum(sum(index))
np.savetxt('outFile', index, fmt='%d')
'''
assume file 'a.txt' is:
1 2 3
4 5 6
7 8 9
10 11 12
'''
# 1. read in two file.
with open('a.txt','r') as fa:
a = [map(int, line.split()) for line in fa]
with open('b.txt','r') as fb:
b = [map(int, line.split()) for line in fb]
# 2. compare values in two files.
sum_value = 0
c = []
for i in range(4):
c.append([])
for j in range(3):
if (a[i][j] == b[i][j]):
c[i].append(1)
sum_value += 1
else:
c[i].append(0)
# 3. print comparison array.
print c
# 4. print sum value.
print sum_value