I am beginning to learn python and am struggling with Syntax.
I have a simple CSV file that looks like this
0.01,10,20,0.35,40,50,60,70,80,90,100
2,22,32,42,52,62,72,82,92,102,112
3,33,43,53,63,5647,83,93,103,113,123
I want to look for the highest and lowest value in all the data in the csv file except in the first value of each row.
So effectively the answer here would be
highestValue=5647
lowestValue=0.35
because the data that is looked at is as follows (it ignored the first value of each row)
10,20,0.35,40,50,60,70,80,90,100
22,32,42,52,62,72,82,92,102,112
33,43,53,63,73,5647,93,103,113,123
I would like my code to work for ANY row length.
I really have to admit I'm struggling but here's what I've tried. I usually program PHP so this is all new to me. I have been working on this simple task for a day and can't fathom it out. I think I'm getting confused with terminology 'lists' for example.
import numpy
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
all_values = record.split(',')
maxvalue = np.max(numpy.asfarray(all_values[1:])
print (maxvalue)
With the test data (the CSV file shown at the very top of this question) I would expect the answer to be
highestValue=5647
lowestValue=0.35
If you're using numpy, you can read your csv file as a numpy.ndarray using numpy.genfromtxt() and then use the array's .max() and .min() methods
import numpy
array = numpy.genfromtxt('Anaconda3JamesData/james_test_3.csv', delimiter=',')
array[:, 1:].max()
array[:, 1:].min()
The [:, 1:] part is using numpy's array indexing. It's saying take all the rows (the first [:, part), and for each row take all but the first column (the 1:] part) . This doesn't work with Python's built in lists.
You're overwriting maxvalue each time through the loop, so you're just getting the max value from the last line, not the whole file. You need to compare with the previous maximum.
maxvalue = None
for record in test_data_list:
all_values = record.split(',')
if maxvalue is None:
maxvalue = np.max(numpy.asfarray(all_values[1:])
else:
maxvalue = max(maxvalue, np.max(numpy.asfarray(all_values[1:]))
You do not need the power of numpy for this problem. A simple CSV reader is good enough:
with open("Anaconda3JamesData/james_test_3.csv") as infile:
r = csv.reader(infile)
rows = [list(map(float, line))[1:] for line in r]
max(map(max, rows))
# 5647.0
min(map(min, rows))
# 0.35
I think using numpy is unneeded for this task. First of all, this:
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
can be simplified into this:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
for record in test_data_file:
We can use a list comprehension to read in all of the values:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
values = [float(val) for line in test_data_file for val in line.split(",")[1:]]
values now contains all relevant numbers, so we can just do:
highest_value = max(values)
lowest_value = min(values)
Here's a pandas solution that can give the desired results:
import pandas as pd
df = pd.read_csv('test1.csv', header=None)
# df:
# 0 1 2 3 4 5 6 7 8 9 10
# 0 0.01 10 20 0.35 40 50 60 70 80 90 100
# 1 2.00 22 32 42.00 52 62 72 82 92 102 112
# 2 3.00 33 43 53.00 63 5647 83 93 103 113 123
df = df.iloc[:, 1:]
print("Highest value: {}".format(df.values.max()))
print("Lowest value: {}".format(df.values.min()))
#Output:
Highest value: 5647.0
Lowest value: 0.35
Related
Let us say I have the following simple data frame. But in reality, I have hundreds thousands of rows like this.
df
ID Sales
倀굖곾ꆹ譋῾理 100
倀굖곾ꆹ 50
倀굖곾ꆹ譋῾理 70
곾ꆹ텊躥㫆 60
My idea is that I want to replace the Chinese digit with randomly generated 8 digits something looks like below.
ID Sales
13434535 100
67894335 50
13434535 70
10986467 60
The digits are randomly generated but they should keep uniqueness as well. For example, row 0 and 2 are same and when it replaced by a random unique ID, it should be the same as well.
Can anyone help on this in Python pandas? Any solution that is already done before is also welcome.
The primary method here will be to use Series.map() on the 'ID's to assign the new values.
Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.
which is exactly what you're looking for.
Here are some options for generating the new IDs:
1. Randomly generated 8-digit integers, as asked
You can first create a map of randomly generated 8-digit integers with each of the unique ID's in the dataframe. Then use Series.map() on the 'ID's to assign the new values back. I've included a while loop to ensure that the generated ID's are unique.
import random
original_ids = df['ID'].unique()
while True:
new_ids = {id_: random.randint(10_000_000, 99_999_999) for id_ in original_ids}
if len(set(new_ids.values())) == len(original_ids):
# all the generated id's were unique
break
# otherwise this will repeat until they are
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 91154173 100
1 27127403 50
2 91154173 70
3 55892778 60
Edit & Warning: The original ids are Chinese characters and they are already length 8. There's definitely more than 10 Chinese characters so with the wrong combination of original IDs, it could become impossible to make unique-enough 8-digit IDs for the new set. Unless you are memory bound, I'd recommend using 16-24 digits. Or even better...
2. Use UUIDs. [IDEAL]
You can still use the "integer" version of the ID instead of hex. This has the added benefit of not needing to check for uniqueness:
import uuid
original_ids = df['ID'].unique()
new_ids = {cid: uuid.uuid4().int for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
(If you are okay with hex id's, change uuid.uuid4().int above to uuid.uuid4().hex.)
Output:
ID Sales
0 10302456644733067873760508402841674050 100
1 99013251285361656191123600060539725783 50
2 10302456644733067873760508402841674050 70
3 112767087159616563475161054356643068804 60
2.B. Smaller numbers from UUIDs
If the ID generated above is too long, you could truncate it, with some minor risk. Here, I'm only using the first 16 hex characters and converting those to an int. You may put that in the uniqueness loop check as done for option 1, above.
import uuid
original_ids = df['ID'].unique()
DIGITS = 16 # number of hex digits of the UUID to use
new_ids = {cid: int(uuid.uuid4().hex[:DIGITS], base=16) for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 14173925717660158959 100
1 10599965012234224109 50
2 14173925717660158959 70
3 13414338319624454663 60
3. Creating a mapping based on the actual value:
This group of options has these advantages:
not needing a uniqueness check since it's deterministically based on the original ID and
So original IDs which were the same will generate the same new ID
doesn't need a map created in advance
3.A. CRC32
(Higher probability of finding a collision with different IDs, compared to option 2.B. above.)
import zlib
df['ID'] = df['ID'].map(lambda cid: zlib.crc32(bytes(cid, 'utf-8')))
Output:
ID Sales
0 2083453980 100
1 1445801542 50
2 2083453980 70
3 708870156 60
3.B. Python's built-in hash() of the orignal ID [My preferred approach in this scenario]
Can be done in one line, no imports needed
Reasonably secure to not generate collisions for IDs which are different
df['ID'] = df['ID'].map(hash)
Output:
ID Sales
0 4663892623205934004 100
1 1324266143210735079 50
2 4663892623205934004 70
3 6251873913398988390 60
3.C. MD5Sum, or anything from hashlib
Since the IDs are expected to be small (8 chars), even with MD5, the probability of a collision is very low.
import hashlib
DIGITS = 16 # number of hex digits of the hash to use
df['ID'] = df['ID'].str.encode('utf-8').map(lambda x: int(hashlib.md5(x).hexdigest()[:DIGITS], base=16))
Output:
ID Sales
0 17469287633857111608 100
1 4297816388092454656 50
2 17469287633857111608 70
3 11434864915351595420 60
Not very expert in Pandas, that's why implementing solution for you with Numpy + Pandas. As solution uses fast Numpy it means it will be much faster than pure Python solution especially if you have thousands of rows.
Try it online!
import pandas as pd, numpy as np
df = pd.DataFrame([
['倀굖곾ꆹ譋῾理', 100],
['倀굖곾ꆹ', 50],
['倀굖곾ꆹ譋῾理', 70],
['곾ꆹ텊躥㫆', 60],
], columns = ['ID', 'Sales'])
u, iv = np.unique(df.ID.values, return_inverse = True)
while True:
ids = np.random.randint(10 ** 7, 10 ** 8, u.size)
if np.all(np.unique(ids, return_counts = True)[1] <= 1):
break
df.ID = ids[iv]
print(df)
Output:
ID Sales
0 31043191 100
1 36168634 50
2 31043191 70
3 17162753 60
Given a dataframe df, create a list of the ids:
id_list = list(df.ID)
Then import the random package
from random import randint
from collections import deque
def idSetToNumber(id_list):
id_set = deque(set(id_list))
checked_numbers = []
while len(id_set)>0:
#get the id
id = randint(10000000,99999999)
#check if the id has been used
if id not in checked_numbers:
checked_numbers.append(id)
id_set.popleft()
return checked_numbers
This gives a list of unique 8-digit number for each of your keys.
Then create a dictionary
checked_numbers = idSetToNumber(id_list)
name2id = {}
for i in range(len(checked_numbers)):
name2id[id_list[i]]=checked_numbers[i]
Last step, replace all the pandas ID fields with the ones in the dictionary.
for i in range(df.shape[0]):
df.ID[i] = str(name2id[df.ID[i]])
I would:
identify the unique ID values
build (from np.random) an array of unique values of same size
build a tranformation dataframe with that array
use merge to replace the original ID values
Possible code:
trans = df[['ID']].drop_duplicates() # unique ID values
n = len(trans)
# np.random.seed(0) # uncomment for reproducible pseudo random sequences
while True:
# build a greater array to have a higher chance to get enough unique values
arr = np.unique(np.random.randint(10000000, 100000000, n + n // 2))
if len(arr) >= n:
arr = arr[:n] # ok keep only the required number
break
trans['new'] = arr # ok we have our transformation table
df['ID'] = df.merge(trans, how='left', on='ID')['new'] # done...
With your sample data (and with np.random.seed(0)), it gives:
ID Sales
0 12215104 100
1 48712131 50
2 12215104 70
3 70969723 60
Per #Arty's comment, np.unique will return a ascending sequence. If you do not want that, shuffle it before using it for the transformation table:
...
np.random.shuffle(arr)
trans['new'] = arr
...
so my df looks like this:
x y group
0 53 10 csv1
1 53 10 csv1
2 48 9 csv0
3 48 9 csv0
4 48 9 csv0
... ... ... ...
I have some files that are depending on the group name and want to use them in a function besides the x and y value.
what I am doing so far is the following:
dfGrouped = df.groupby('group') #group the dataframe
df['newcol'] = np.nan #crete new empty col
#use for loop to load file depending on group, note the file is very large, thats why I want to load it only once per group
for name, group in groupHashed:
file = open(name+'.txt')
#open the file
df['newcol'] = df[df['group'] == name].apply(lambda row: newValueFromFile(row.x,row.y, file), axis=1)
It seemed to work at first, unfortunately, newcol only holds the value of the last loop and seems to override the values created earlier with nan. Somebody any idea?
instead of file = open, use
with open('filename.txt', 'a') as file:
and then for the lambda expression file.write...
The 'a' in the opening tells it should append the data to the existing.
I guess currently you are overwriting the content of the file.
'with open()' also takes care for the automatic closing after you're done with the file.
Say I have in memory a large file, loaded using chunksize in pandas. Now I have to compare every value with the ones ajdacent to it. My problem is that I can't seem to select at the same time the extreme values (in first and last position) of two different chunks.
Example:
print(df)
a
0 102
1 101
2 104
3 110
4 104
5 105
count = 0
for i in range(len(df)-1):
if df.iloc[i+1]['a']>df.iloc[i]['a']:
count+=1
count would be equal to 3 in this example. But say I have loaded df from a .csv with chunksize=1, how would I achieve a similar result, considering that values will be in different chunks? In practice chunksize is 10000 and so the problem would be limited to the first and last value for each chunk.
EDIT:
Here is an example where I store the last_chunk_value to update the value when running the next loop.
I've tested a 'brut force' method to compare with the 'chunk script'. The results are the same with both methods.
By the way, I've simplified the 'brut force' method.
import pandas as pd
import numpy as np
import random
# 'data' generation as csv file
file = open("data.csv", 'w')
file.write('rand_int' + '\n')
for i in range(0, 10000):
file.write(str(random.randint(80,120)) + '\n')
file.close()
# "brute force method"
df = pd.read_csv("data.csv")
length = int( (df.shift(-1) - df > 0).sum() )
print('number=', length)
# chunksize method
chunksize = 33
length = 0
last_chunk_value = np.nan
for chunk in pd.read_csv("data.csv", chunksize=chunksize):
chunk['shift'] = chunk.shift(1)
chunk.iloc[0, 1] = last_chunk_value
length += (chunk['rand_int'] - chunk['shift'] > 0).sum()
last_chunk_value = chunk.iloc[-1, 0]
print('number=', length)
Using Python, how do I break a text file into data frames where every 84 rows is a new, different dataframe? The first column x_ft is the same value every 84 rows then increments up by 5 ft for the next 84 rows. I need each identical x_ft value and corresponding values in the row for the other two columns (depth_ft and vel_ft_s) to be in the new dataframe too.
My text file is formatted like this:
x_ft depth_ft vel_ft_s
0 270 3535.755 551.735107
1 270 3534.555 551.735107
2 270 3533.355 551.735107
3 270 3532.155 551.735107
4 270 3530.955 551.735107
.
.
33848 2280 3471.334 1093.897339
33849 2280 3470.134 1102.685547
33850 2280 3468.934 1113.144287
33851 2280 3467.734 1123.937134
I have tried many, many different ways but keep running into errors and would really appreciate some help.
I suggest looking into pandas.read_table, which automatically outputs a DataFrame. Once doing so, you can isolate the rows of the DataFrame that you are looking to separate (every 84 rows) by doing something like this:
df = #Read txt datatable with Pandas
arr = []
#This gives you an array of all x values in your dataset
for x in range(0,403):
val = 270+5*x
arr.append(val)
#This generates csv files for every row with a specific x_ft value with its corresponding columns (depth_ft and vel_ft_s)
for x_value in arr:
tempdf = df[(df['x_ft'])] = x_value
tempdf.to_csv("df"+x_value+".csv")
You can get indexes to split your data:
rows = 84
datasets = round(len(data)/rows) # total datasets
index_list = []
for index in data.index:
x = index % rows
if x == 0:
index_list.append(index)
print(index_list)
So, split original dataset by indexes:
l_mod = index_list + [max(index_list)+1]
dfs_list = [data.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
print(len(dfs_list))
Outputs
print(type(dfs_list[1]))
# pandas.core.frame.DataFrame
print(len(dfs_list[0]))
# 84
I have a large text file containing several million lines of data. The very first column contains position coordinates. I need to create another file from this original data, but that only contains specified non-contiguous intervals based on the position coordinates. I have another file containing the coordinates for each interval. For instance, my original file is in a format similar to this:
Position Data1 Data2 Data3 Data4
55 a b c d
63 a b c d
68 a b c d
73 a b c d
75 a b c d
82 a b c d
86 a b c d
Then lets say I have my file containing intervals that looks something like this...
name1 50 72
name2 78 93
Then I want my new file to look something like this...
Position Data1 Data2 Data3 Data4
55 a b c d
63 a b c d
68 a b c d
82 a b c d
86 a b c d
So far I have created a function to write the data from the original file contained within a specific interval to my new file. My code is as follows:
def get_block(beg,end):
output=open(output_table,'a')
with open(input_table,'r') as f:
for line in f:
line=line.strip("\r\n")
line=line.split("\t")
position=int(line[0])
if int(position)<=beg:
pass
elif int(position)>=end:
break
else:
for i in line:
output.write(("%s\t")%(i))
output.write("\n")
I then create a list containing the pairs of my intervals and then loop through my original file using the above function like this:
#coords=[[start1,stop1],[start2,stop2],[start3,stop3]..etc]
for i in coords:
start_p=int(i[0]) ; stop_p=int(i[1])
get_block(start_p,stop_p)
This performs what I want, however it gets exponentially slower as it moves along my coordinate list because I am having to read through my entire file until I reach the specified start coordinate each time through the loop. Is there a more efficient way of accomplishing this? Is there a way to skip to a specific line each time instead of reading over every line?
Thanks for the suggestions to use pandas. Previously, my original code had been running for about 18 hours and was only half way finished. Using pandas, it created my desired file in under 5 mins. For future reference and if anyone else has a similar task, here is the code that I used.
import pandas as pd
data=pd.io.parsers.read_csv(input_table,delimiter="\t")
for i in coords:
start_p=int(i[0]);stop_p=int(i[1])
df=data[((data.POSITION>=start_p)&(data.POSITION<=stop_p))]
df.to_csv(output_table,index=False,sep="\t",header=False,cols=None,mode='a')
I'd just use the built-in csv module to simplify reading the input. To further speed things up, all the coord ranges could be read in at once, which would allow the selection process to occur in one pass through the data file.
import csv
# read all coord ranges into memory
with open('ranges', 'rb') as ranges:
range_reader = csv.reader(ranges, delimiter='\t')
coords = [map(int, (start, stop)) for name,start,stop in range_reader]
# make one pass through input file and extract positions specified
with open('output_table', 'w') as outf, open('input_table', 'rb') as inf:
input_reader = csv.reader(inf, delimiter='\t')
outf.write('\t'.join(input_reader.next())+'\n') # copy header row
for row in input_reader:
for coord in coords:
if coord[0] <= int(row[0]) <= coord[1]:
outf.write('\t'.join(row)+'\n')
break;