Python For loop with data (csv) - python

I have this data:
http://prntscr.com/gojey0
Which keeps going on downward.
How do I find the top 20 most common platforms using python code?
I'm really lost. I thought of maybe going through the list in a for loop and counting each one? that seems wrong though..

Use pandas: http://pandas.pydata.org/
something like:
import pandas as pd
df = pd.read_csv("your_csv_file.csv")
top_platforms = df.nlargest(20, "Score")["Platform"]

A dictionary would be a good choice for collecting this information:
Initialize an empty dict.
For each row in the csv file:
Get the platform column.
If that platform is not already in the dict, create it with a count of one.
Otherwise if it is already in the dict, increment its count by one.
When you're done, sort the dict by the count value and print the top 20 entries.

I would use pandas to read in csv files
import pandas as pd
from collection import Counter
df = pd.read_csv('DATA.csv') # read the csv file into a dataframe *df*
# create counter object containing dictionary
# invoke the pandas groupby and count methods
d = Counter(dict(df.groupby(['Platform'])['Platform'].count()))
d will be a counter object "containing" a dictionary of the form {<platform>:<number of counts in dataset>}
You can get the top k most common platforms as follows:
k = 20
d.most_common(k)
>>> [('<platform1>', count1),
('<platform2>', count2),
('<platform3>', count3),
('<platform4>', count4),
....
Hope that helps. In future it would be better to see the head (first few lines) of your data or what code you have tried so far... or even what data wrangling tool you're using!

Related

How to replace empty values with reference to another dataframe?

I have 2 data frames. One is reference table with columns: code and name. Other one is list of dictionaries. The second data frame has code filled up but some names as empty strings. I am thinking of performing 2 for loops to get to the dictionary. But, I am new to this so unsure how to get the value from reference table.
Started with something like this:
for i in sample:
for j in i:
if j['name']=='':
(j['code'])
I am unsure how to proceed with the code. I think there is a very simple way with .map() function. Can someone help?
Reference table:
enter image description here
Edit needed table:
enter image description here
It seems to me that in this particular case you're using Pandas only to work with Python data structures. If that's the case, it would make sense to ditch Pandas altogether and just use Python data structures - usually, it results in more idiomatic and readable code that often performs better than Pandas with dtype=object.
In any case, here's the code:
import pandas as pd
sample_name = pd.DataFrame(dict(code=[8, 1, 6],
name=['Human development',
'Economic managemen',
'Social protection and risk management']))
# We just need a Series.
sample_name = sample_name.set_index('code')['name']
sample = pd.Series([[dict(code=8, name='')],
[dict(code=1, name='')],
[dict(code=6, name='')]])
def fix_dict(d):
if not d['name']:
d['name'] = sample_name.at[d['code']]
return d
def fix_dicts(dicts):
return [fix_dict(d) for d in dicts]
result = sample.map(fix_dicts)

Pandas - KeyError - Dropping rows by index in a nested loop

I have a Pandas dataframe named pd. I am attempting to use a nested-for-loop to iterate through each tuple of the dataframe and, at each iteration, compare the tuple with all other tuples in the frame. During the comparison step, I am using Python's difflib.SequenceMatcher().ratio() and dropping tuples that have a high similarity (ratio > 0.8).
Problem:
Unfortunately, I am getting a KeyError after the first, outer-loop, iteration.
I suspect that, by dropping the tuples, I am invalidating the outer-loop's indexer. Or, I am invalidating the inner-loop's indexer by attempting to access an element that doesn't exist (dropped).
Here is the code:
import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher
# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.
with open("twitter data/tweetsR.json", "r") as read_file:
data = json.load(read_file) # Load the source data set, esport tweets.
df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note,
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.
def duplicates(df):
for ind in df.index:
a = df['text'][ind]
for indd in df.index:
if indd != 26747: # Trying to prevent an overstep keyError here
b = df['text'][indd+1]
if similar(a,b) >= 0.80:
df.drop((indd+1), inplace=True)
print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed
duplicates(df)
Error Output:
Can anyone help me understand this and/or fix it?
One solution, which was mentioned by #KazuyaHatta, is the itertools.combination(). Although, the way I've used it (there may be another way), it's O(n^2). So, in this case, with 27,000 tuples, it's nearly 357,714,378 combinations to iterate (too long).
Here is the code:
# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
# Find out how to improve the speed of this
excludes = set()
combos = itertools.combinations(df.index, 2)
for combo in combos:
if str(combo) not in excludes:
if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
excludes.add(f'{combo[0]}, {combo[1]}')
excludes.add(f'{combo[1]}, {combo[0]}')
print("Dropped: " + str(combo))
print(len(excludes))
duplicates(df)
My next step, which #KazuyaHatta described, is to attempt the dropping-by-mask method.
Note: I unfortunately won't be able to post a sample of the dataset.

How to efficiently match values from 2 series and add them to a dataframe

I have a csv file "qwi_ak_se_fa_gc_ns_op_u.csv" which contains a lot of observations of 80 variables. One of them is geography which is the county. Every county belongs to something called a Commuting Zone (CZ). Using a matching table given in "czmatcher.csv" I can assign a CZ to every county given in geography.
The code below shows my approach. It is simply going through every row and finding its CZ by going through the whole "czmatcher.csv" row and finding the right one. Then i proceed to just copy the values using .loc. The problem is, this took over 10 hours to run on a 0.5 GB file (2.5 million rows) which isn't that much and my intuition says this should be faster?
This picture illustrates the way the csv files look like. The idea would be to construct the "Wanted result (CZ)" column, name it CZ and add it to the dataframe.
File example
import pandas as pd
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']
Is there a faster way of doing this?
The best way to do this is a left merge on your dataframes,
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
I assume that in both dataframes the column country is spelled the same,
data_final = data.merge(czm, how='left', on = 'country')
If it isn't spelled the same way you can rename your columns,
data.rename(columns:{col1:country}, inplace=True)
read the doc for further information https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
In order to make it faster, but not reworking your whole solution I would recommend to use Dask DataFrames, to say it simple, Dask divides your reads your csv in chunks and processes each of them in parallel. After reading csv. you can use .compute method to get pandas df instead of Dask df.
This will look like this:
import pandas as pd
import dask.dataframe as dd #IMPROT DASK DATAFRAMES
# YOU NEED TO USE dd.read_csv instead of pd.read_csv
data = dd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
data = data.compute()
czm = dd.read_csv("czmatcher.csv")
czm = czm.compute()
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']

What is a proper idiom in pandas for creating a dataframes from the output of a apply function on a df?

Edit --- I've made some progress, and discovered the drop_duplicates method in pandas, which saves some custom duplicate removal functions I created.
This changes the question in a couple of ways, b/c it changes my initial requirements.
One of the operations I need to conduct is grabbing the latest feed entries --- the feed urls exist in a column in a data frame. Once I've done the apply I get feed objects back:
import pandas as pd
import feedparser
import datetime
df_check_feeds = pd.DataFrame({'account_name':['NYTimes', 'WashPo'],'feed_url':['http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', 'http://feeds.washingtonpost.com/rss/homepage'], 'last_update':['2015-12-28 23:50:40', '2015-12-28 23:50:40']})
df_check_feeds["feeds_results"] = pd.DataFrame(df_check_feeds.feed_url.apply(lambda feed_url: feedparser.parse(feed_url)))
df_check_feeds["entries"] = df_check_feeds.feeds_results.apply(lambda x: x.entries)
So, now I'm stuck with the feed entries in the "entries" column, I'd like to create a two new data frames in one apply method, and concatenate the two frames immediately.
I've expressed the equivalent in a for loop:
frames_list = []
for index in df_check_feeds.index:
df_temp = pd.DataFrame(df_check_feeds.entries[index])
df_temp['account_name'] = df_check_feeds.ix[index,'account_name']
# some error checking on the info here
frames_list.append(df_temp)
df_total_results = pd.concat(frames_list)
df_total_results
I realize I could do this in a for loop (and indeed have written that), but I feel there is some better, more succinct pandas idiomatic way of writing this statement.
A more compact way could be:
df_total_results = df_check_feeds.groupby('account_name').apply(lambda x: pd.DataFrame(x['entries'].iloc[0]))

Grouping data by value in first column

I'm trying to group data from a 2 column object based on the value of a first column. I need this data in a list so I can sort them afterwards. I am fetching interface data with snmp on large number of machines. In the example I have 2 interfaces. I need data grouped by interface preferably in a list.
Data i get is in object item:
for i in item:
print i.oid, i.val
ifDescr lo
ifDescr eth0
ifAdminStatus 1
ifAdminStatus 1
ifOperStatus 1
ifOperStatus 0
i would like to get this data sorted in a list by value in the first column, like this:
I would like to get this data in a list, so it looks like this:
list=[[lo,1,1], [eth0,1,0]]
Solution I have is oh so dirty and long and I'm embarrassed to post it here, so any help is appreciated.
Here is my solution so you get better picture what I'm talking about. What I did is put each interface data in separate list based on item.oid, and then iterated trough cpu list and compared it to memory and name based on item.iid. In the end I have all data in cpu list where each interface is an element of the list. This solution works, but is too slow for my needs.
cpu=[]
memory=[]
name=[]
for item in process:
if item.oid=='ifDescr':
cpu.append([item.iid, int(item.val)])
if item.oid=='ifAdminStatus':
memory.append([item.iid, int(item.val)])
if item.oid=='ifOperStatus':
name.append([item.iid, item.val])
for c in cpu:
for m in memory:
if m[0]==c[0]:
c.append(m[1])
for n in name:
if n[0]==c[0]:
c.append(n[1])
cpu=sorted(cpu,key=itemgetter(1),reverse=True) #sorting is easy
Is there a pythonic, short and faster way of doing this? Limiting factor is that I get data in a 2 column object with key=data values.
Not sure I follow your sorting as I don't see any order but to group you can use a dict grouping by oid using a defaultdict for the repeating keys:
data = """ifDescr lo
ifDescr eth0
ifAdminStatus 1
ifAdminStatus 1
ifOperStatus 1
ifOperStatus 0"""
from collections import defaultdict
d = defaultdict(list)
for line in data.splitlines():
a, b = line.split()
d[a].append(b)
print((d.items()))
[('ifOperStatus', ['1', '0']), ('ifAdminStatus', ['1', '1']), ('ifDescr', ['lo', 'eth0'])]
using your code just use the attributes:
for i in item:
d[i.oid].append(i.val)
Pandas is a great way to work with data. Here is a quick example code. Check out the official website for more info.
# Python script using Pandas and Numpy
from pandas import DataFrame
from numpy import random
# Data with the dictionary keys defining the columns
data_dictionary = {'a': random.random(5),
'b': random.random(5)}
# Make a data frame
data_frame = DataFrame(data_dictionary)
print(data_frame)
# Return an new data frame with a sorted first column
data_frame_sorted = data_frame.sort_index(by='a')
print(data_frame_sorted)
This should run if you have numpy an pandas installed. If you don't have any clue about installing pandas go get the "anaconda python distribution."

Categories

Resources