I'm using the python packages xlrd and xlwt to read and write from excel spreadsheets using python. I can't figure out how to write the code to solve my problem though.
So my data consists of a column of state abbreviations and a column of numbers, 1 through 7. There are about 200-300 entries per state, and i want to figure out how many ones, twos, threes, and so on exist for each state. I'm struggling with what method I'd use to figure this out.
normally i would post the code i already have but i don't even know where to begin.
Prepare a dictionary to store the results.
Get the numbers of line with data you have using xlrd, then iterate over each of them.
For each state code, if it's not in the dict, you create it also as a dict.
Then you check if the entry you read on the second column exists within the state key on your results dict.
4.1 If it does not, you'll create it also as a dict, and add the number found on the second column as a key to this dict, with a value of one.
4.2 If it does, just increment the value for that key (+1).
Once it has finished looping, your result dict will have the count for each individual entry on each individual state.
I'm going to assume you already know how to do to the easy part of this and read a spreadsheet into Python as a list of lists. So, you've got something like this:
data = [['CA', 1],
['AZ', 2],
['NM', 3],
['CA', 2]]
Now, what you want for each state, for each number, a count of the number of times that number appears. So:
counts = {}
for state, number in data:
counts.setdefault(state, collections.Counter())[number] += 1
Related
I am not sure how to fix this. This is the code I want, but I do not want it to continuously repeat the names of the rows in the output.
I'd suggest a few changes to your code.
Firstly, to answer your question, you can remove the multiple occurences of the words by using:
select_merch = d.loc[df['Category] == 'Merchandise'].sum()['Cost]
This will make sure to select only the sum of the Cost column for a particular dataframe. Also this code is very redundant and confusing. What you can do is also create a list and iterate over it for each category.
list(df['Category'].unique()) will give you a list of all the unqiue categories. Store it in a list and then iterate over it. Plus, you don't need to do a d=pd.Dataframe(df) everytime, you can use df itself as well.
I have a problem. I have a code that using selenium and getting information form different sites and put then into one list. And after all, python will delete all information in the list, I need to write them to the excel:
List = []
for values in List:
...
List.append(values)
List.append(some_information_from_selenium)
And in the end of iteration:
List.clear()
I need to save information before clear() and after cleat List, add new information to the excel. This iteration have limit - 100. Need to create a new excel file, and adding information to this. List will delete and then append new information, this iteration will be 100 times. I will have 18 columns and 100 rows. I can use whatever i want.
:UPD:
One more question: if i use
data = pd.DataFrame()
data({ "Name":List[some_index]
"Surname":List[some_index_1]
.... })
data.to_excel("Excel.xlsx")
Why I got error 'DataFrame' object is not callable and how can i solve this
I'm not 100% sure what you're trying to do from your code, but instead of clearing your list variable each time lets hold it in some sort of nested collection.
a simple dictionary will do.
from collections import defaultdict
data_dict = defaultdict(list)
for num in range(10,110,10): #call your func in iterations of 10s
your code
data_dict[i].append(some_information_from_selenium)
each iteration of 10 will hold your nested data.
data_dict[10]
which you can then pass into pandas.
I am looking to assess if there is a better method to append to a list within a list within a dictionary.
I have many different packets and associated strings to search for in a huge text file. Associated to each string is a value I want to store in a list so that I can perform calculations like average/max/min.
Due to the packet variations and associated strings for each packet I was looking to keep a dictionary entry to a single line. So I would have a Key as the packet ID and the value as a list of elements, see below
mycompactdict={
"packetID_001":[12,15,'ID MATCH',[['search_string',[] ],['search_string2',[] ]]]
"packetID_002":[...etc]
}
The 12,15 ints are references I use later in Excel plotting. The 'ID_MATCH' entry is my first check to see if the packet ID matches the file object. The 'search_string' references are the strings I am looking for and the blank lists next to them is where I hope to drop the values associated to each search string after splitting the line in the text file.
Now I may be biting off more than Python can chew... I realize there is a list within a list within a list within a list within a dict!
Here's a start of my code...
def process_data(file_object):
split_data = file_object.split('\n')
for key in mycompactdict:
if mycompactdict[key][2] in file_object:
for line in split_data:
if item[0] for item in mycompactdict[key][3] in line:
value = line.split('=', 1)
value.strip()
print value
and then append the stripped value to mycompactdict[key][6]item[1]
Am I on the wrong approach which will cause performance problems later on, and is there a cleaner alternative?
Below is an example of the file_object in the for of a unicode block of text, there are both matching and differing packet IDs I need to account for.
14:27:42.0 ID_21 <<(ID_MATCH)
Type = 4
Fr = 2
search_string1 = -12
search_string2 = 1242
I would not try to re-invent the wheel were I in your position. Thus, I would use Pandas. It has something called DataFrames that would be a good fit for what you are trying to do. In addition, you can export those into exel spread sheets. Have a look at the 10min introduction.
I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.
You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.
Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.
Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.
An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.
I am very new to python and my apologies is this has already been answered. I can see a lot of previous answers to 'sort' questions but my problem seems a little different from these questions and answers.
I have a list of keys, with each key contained in a tuple, that I am trying to sort. Each key is derived from a subset of the columns in a CSV file, but this subset is determined by the user at runtime and can't be hard coded as it will vary from execution to execution. I also have a datetime value that will always form part of the key as the last item in the tuple (so there will be at least one item to sort on - even if the user provides no additional items).
The tuples to be sorted look like:
(col0, col1, .... colN, datetime)
Where col0 to colN are based on the values found in columns in a CSV file, and the 'N' can change from run to run.
In each execution, the tuples in the list will always have the same number of items in each tuple. However, they need to be able to vary from run to run based on user input.
The sort looks like:
sorted(concurrencydict.keys(), key=itemgetter(0, 1, 2))
... when I do hard-code the sort based on the first three columns. The issue is that I don't know in advance of execution that 3 items will need to be sorted - it may be 1, 2, 3 or more.
I hope this description makes sense.
I haven't been able to think of how I can get itemgetter to accept a variable number of values.
Does anyone know whether there is an elegant way of performing a sort based on a variable number of items in python where the number of sort items is determined at run time (and not based on fixed column numbers or attribute names)?
I guess I'll turn my comment into an answer.
You can pass a variable number of arguments (which are packed into an iterable object) by using *args in the function call. In your specific case, you can put your user-supplied selection of column numbers to sort by into a sort_columns list or tuple, then call:
sorted_keys = sorted(concurrencydict.keys(), key=itemgetter(*sort_columns))