Count lines containing word
I have a file with multiple lines. I want to know, for each word that appears in the total file, how many lines contain that word, for example:
0 hello world the man is world
1 this is the world
2 a different man is the possible one
The result I'm expecting is:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
Note that the count for "world" is 2, not 3, since the word appears on 2 lines. Because of this, translating blanks to newline chars wouldn't be the exact solution.
text-processing
add a comment |
I have a file with multiple lines. I want to know, for each word that appears in the total file, how many lines contain that word, for example:
0 hello world the man is world
1 this is the world
2 a different man is the possible one
The result I'm expecting is:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
Note that the count for "world" is 2, not 3, since the word appears on 2 lines. Because of this, translating blanks to newline chars wouldn't be the exact solution.
text-processing
What have you try to the moment?
– Romeo Ninov
Jan 4 at 15:28
This seems highly relevant: unix.stackexchange.com/a/332890/224077
– Panki
Jan 4 at 15:41
add a comment |
I have a file with multiple lines. I want to know, for each word that appears in the total file, how many lines contain that word, for example:
0 hello world the man is world
1 this is the world
2 a different man is the possible one
The result I'm expecting is:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
Note that the count for "world" is 2, not 3, since the word appears on 2 lines. Because of this, translating blanks to newline chars wouldn't be the exact solution.
text-processing
I have a file with multiple lines. I want to know, for each word that appears in the total file, how many lines contain that word, for example:
0 hello world the man is world
1 this is the world
2 a different man is the possible one
The result I'm expecting is:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
Note that the count for "world" is 2, not 3, since the word appears on 2 lines. Because of this, translating blanks to newline chars wouldn't be the exact solution.
text-processing
text-processing
edited Jan 4 at 18:33
Jeff Schaller
43.4k1160140
43.4k1160140
asked Jan 4 at 15:16
NetzsoocNetzsooc
586
586
What have you try to the moment?
– Romeo Ninov
Jan 4 at 15:28
This seems highly relevant: unix.stackexchange.com/a/332890/224077
– Panki
Jan 4 at 15:41
add a comment |
What have you try to the moment?
– Romeo Ninov
Jan 4 at 15:28
This seems highly relevant: unix.stackexchange.com/a/332890/224077
– Panki
Jan 4 at 15:41
What have you try to the moment?
– Romeo Ninov
Jan 4 at 15:28
What have you try to the moment?
– Romeo Ninov
Jan 4 at 15:28
This seems highly relevant: unix.stackexchange.com/a/332890/224077
– Panki
Jan 4 at 15:41
This seems highly relevant: unix.stackexchange.com/a/332890/224077
– Panki
Jan 4 at 15:41
add a comment |
8 Answers
8
active
oldest
votes
Another Perl variant, using List::Util
$ perl -MList::Util=uniq -alne '
map { $h{$_}++ } uniq @F }{ for $k (sort keys %h) {print "$k: $h{$k}"}
' file
0: 1
1: 1
2: 1
a: 1
different: 1
hello: 1
is: 3
man: 2
one: 1
possible: 1
the: 3
this: 1
world: 2
add a comment |
Straightfoward-ish in bash:
declare -A wordcount
while read -ra words; do
# unique words on this line
declare -A uniq
for word in "${words[@]}"; do
uniq[$word]=1
done
# accumulate the words
for word in "${!uniq[@]}"; do
((wordcount[$word]++))
done
unset uniq
done < file
Looking at the data:
$ declare -p wordcount
declare -A wordcount='([possible]="1" [one]="1" [different]="1" [this]="1" [a]="1" [hello]="1" [world]="2" [man]="2" [0]="1" [1]="1" [2]="1" [is]="3" [the]="3" )'
and formatting as you want:
$ printf "%sn" "${!wordcount[@]}" | sort | while read key; do echo "$key:${wordcount[$key]}"; done
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
add a comment |
It's a pretty straight-forward perl script:
#!/usr/bin/perl -w
use strict;
my %words = ();
while (<>) {
chomp;
my %linewords = ();
map { $linewords{$_}=1 } split / /;
foreach my $word (keys %linewords) {
$words{$word}++;
}
}
foreach my $word (sort keys %words) {
print "$word:$words{$word}n";
}
The basic idea is to loop over the input; for each line, split it into words, then save those words into a hash (associative array) in order to remove any duplicates, then loop over that array of words and add one to an overall counter for that word. At the end, report on the words and their counts.
1
A slight problem with this is in my opinion that it does not respect what the usual definition of a word is, since it splits on a single space character. If two spaces were found somewhere, an empty string inbetween would be considered a word as well if I'm not mistaken. Let alone if words were separated by other punctuation characters. Of course, it was not specified in the question whether "word" is understood as the programmer's concept of a "word", or as a word of a natural language.
– Larry
Jan 4 at 16:38
add a comment |
A solution that calls several programs from a shell:
fmt -1 words.txt | sort -u | xargs -Ipattern sh -c 'echo "pattern:$(grep -cw pattern words.txt)"'
A little explanation:
The fmt -1 words.txt
prints out all the words, 1 per line, and the | sort -u
sorts this output and extracts only the unique words from it.
In order to count the occurences of a word in a file, one can use grep
(a tool meant to search files for patterns). By passing the -cw
option, grep gives the number of word matches it finds. So you can find the total number of occurrences of pattern
using grep -cw pattern words.txt
.
The tool xargs
allows us to do this for each and every single word output by sort
. The -Ipattern
means that it will execute the following command multiple times, replacing each occurrence of pattern with a word it reads from standard input, which is what it gets from sort
.
The indirection with sh
is needed because xargs
only knows how to execute a single program, given it's name, passing everything else as arguments to it. xargs
does not handle things like command substitution. The $(...)
is command substitution in the above snippet, as it substitutes the output from grep
into echo
, allowing it to be formatted correctly. Since we need the command substitution, we must use the sh -c
command which runs whatever it recieves as an argument in its own shell.
An optimisation to this approach:fmt -1 words.txt | sort | uniq -c | awk '{ print $2 ":" $1 }'
– matja
Jan 5 at 0:14
@matja issort | uniq -c
more efficient thansort -u
?
– vikarjramun
Jan 5 at 3:31
vikarjramun@ no, but uniq -c gives you the counts of each word in one pass, so you don't have to use xargs to do multiple passes of the input file for each word.
– matja
Jan 5 at 10:11
1
@matja: I actually made the answer you provided before the current one. However, it does not do what OP asked for. I misread the question at first entirely as well, and was corrected by glenn jackman. What you are suggesting would count every occurrence of each word. What OP asked for is to count the number of lines each word occurs in at least once.
– Larry
Jan 5 at 10:17
add a comment |
Another simple alternative would be to use Python (>3.6). This solution has the same problem as the one mentioned by @Larry in his comment.
from collections import Counter
with open("words.txt") as f:
c = Counter(word for line in [line.strip().split() for line in f] for word in set(line))
for word, occurrence in sorted(c.items()):
print(f'{word}:{occurrence}')
# for Python 2.7.x compatibility you can replace the above line with
# the following one:
# print('{}:{}'.format(word, occurrence))
A more explicit version version of the above:
from collections import Counter
FILENAME = "words.txt"
def find_unique_words():
with open(FILENAME) as f:
lines = [line.strip().split() for line in f]
unique_words = Counter(word for line in lines for word in set(line))
return sorted(unique_words.items())
def print_unique_words():
unique_words = find_unique_words()
for word, occurrence in unique_words:
print(f'{word}:{occurrence}')
def main():
print_unique_words()
if __name__ == '__main__':
main()
Output:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
The above also assumes that words.txt is on the same directory as script.py. Note that this is not much different from other solutions provided here, but perhaps somebody will find it useful.
add a comment |
Trying to do it with awk:
count.awk:
#!/usr/bin/awk -f
# count line containing word
{
for (i = 1 ; i <= NF ; i++) {
word_in_a_line[$i] ++
if (word_in_a_line[$i] == 1) {
word_line_count[$i] ++
}
}
delete word_in_a_line
}
END {
for (word in word_line_count){
printf "%s:%dn",word,word_line_count[word]
}
}
Run it by:
$ awk -f count.awk ./test.data | sort
add a comment |
A pure bash answer
echo "0 hello world the man is world
1 this is the world
2 a different man is the possible one" | while IFS=$'n' read -r line; do echo $line | tr ' ' 'n' | sort -u; done | sort | uniq -c
1 0
1 1
1 2
1 a
1 different
1 hello
3 is
2 man
1 one
1 possible
3 the
1 this
2 world
I looped unique words on each line and passed it to uniq -c
edit: I did not see glenn's answer. I found it strange to not see a bash answer
add a comment |
Simple, though doesn't care if it reads the file many times:
sed 's/ /n/g' file.txt | sort | uniq | while read -r word; do
printf "%s:%dn" "$word" "$(grep -Fw "$word" file.txt | wc -l)"
done
EDIT: Despite converting spaces to newlines, this does count lines that have an occurrence of each word and not the occurrences of the words themselves. It gives the result:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
which is character-by-character identical to OP's example result.
1
Read the question again. It literally saystranslating blanks to newline chars wouldn't be the exact solution
.
– Sparhawk
Jan 5 at 9:59
@Sparhawk Read the answer again. This does give the answer he gave as example, including giving the result of 2 instead of 3 for world. He meant that doing something likesed 's/ /n/g' | sort | uniq -c
would not work because it'd give the answer 3 for world, but that's not what this answer does. It correctly counts the lines where the words occur and not the occurrences themselves, just like OP wanted.
– JoL
Jan 6 at 7:03
Ah right, apologies! I would recommend putting in an explanation of your code, which is both helpful to the questioner, and clarifies what it does. Also, as a minor point, you probably wantread -r
here.
– Sparhawk
Jan 6 at 9:38
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f492501%2fcount-lines-containing-word%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
8 Answers
8
active
oldest
votes
8 Answers
8
active
oldest
votes
active
oldest
votes
active
oldest
votes
Another Perl variant, using List::Util
$ perl -MList::Util=uniq -alne '
map { $h{$_}++ } uniq @F }{ for $k (sort keys %h) {print "$k: $h{$k}"}
' file
0: 1
1: 1
2: 1
a: 1
different: 1
hello: 1
is: 3
man: 2
one: 1
possible: 1
the: 3
this: 1
world: 2
add a comment |
Another Perl variant, using List::Util
$ perl -MList::Util=uniq -alne '
map { $h{$_}++ } uniq @F }{ for $k (sort keys %h) {print "$k: $h{$k}"}
' file
0: 1
1: 1
2: 1
a: 1
different: 1
hello: 1
is: 3
man: 2
one: 1
possible: 1
the: 3
this: 1
world: 2
add a comment |
Another Perl variant, using List::Util
$ perl -MList::Util=uniq -alne '
map { $h{$_}++ } uniq @F }{ for $k (sort keys %h) {print "$k: $h{$k}"}
' file
0: 1
1: 1
2: 1
a: 1
different: 1
hello: 1
is: 3
man: 2
one: 1
possible: 1
the: 3
this: 1
world: 2
Another Perl variant, using List::Util
$ perl -MList::Util=uniq -alne '
map { $h{$_}++ } uniq @F }{ for $k (sort keys %h) {print "$k: $h{$k}"}
' file
0: 1
1: 1
2: 1
a: 1
different: 1
hello: 1
is: 3
man: 2
one: 1
possible: 1
the: 3
this: 1
world: 2
answered Jan 4 at 16:11
steeldriversteeldriver
37k45287
37k45287
add a comment |
add a comment |
Straightfoward-ish in bash:
declare -A wordcount
while read -ra words; do
# unique words on this line
declare -A uniq
for word in "${words[@]}"; do
uniq[$word]=1
done
# accumulate the words
for word in "${!uniq[@]}"; do
((wordcount[$word]++))
done
unset uniq
done < file
Looking at the data:
$ declare -p wordcount
declare -A wordcount='([possible]="1" [one]="1" [different]="1" [this]="1" [a]="1" [hello]="1" [world]="2" [man]="2" [0]="1" [1]="1" [2]="1" [is]="3" [the]="3" )'
and formatting as you want:
$ printf "%sn" "${!wordcount[@]}" | sort | while read key; do echo "$key:${wordcount[$key]}"; done
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
add a comment |
Straightfoward-ish in bash:
declare -A wordcount
while read -ra words; do
# unique words on this line
declare -A uniq
for word in "${words[@]}"; do
uniq[$word]=1
done
# accumulate the words
for word in "${!uniq[@]}"; do
((wordcount[$word]++))
done
unset uniq
done < file
Looking at the data:
$ declare -p wordcount
declare -A wordcount='([possible]="1" [one]="1" [different]="1" [this]="1" [a]="1" [hello]="1" [world]="2" [man]="2" [0]="1" [1]="1" [2]="1" [is]="3" [the]="3" )'
and formatting as you want:
$ printf "%sn" "${!wordcount[@]}" | sort | while read key; do echo "$key:${wordcount[$key]}"; done
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
add a comment |
Straightfoward-ish in bash:
declare -A wordcount
while read -ra words; do
# unique words on this line
declare -A uniq
for word in "${words[@]}"; do
uniq[$word]=1
done
# accumulate the words
for word in "${!uniq[@]}"; do
((wordcount[$word]++))
done
unset uniq
done < file
Looking at the data:
$ declare -p wordcount
declare -A wordcount='([possible]="1" [one]="1" [different]="1" [this]="1" [a]="1" [hello]="1" [world]="2" [man]="2" [0]="1" [1]="1" [2]="1" [is]="3" [the]="3" )'
and formatting as you want:
$ printf "%sn" "${!wordcount[@]}" | sort | while read key; do echo "$key:${wordcount[$key]}"; done
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
Straightfoward-ish in bash:
declare -A wordcount
while read -ra words; do
# unique words on this line
declare -A uniq
for word in "${words[@]}"; do
uniq[$word]=1
done
# accumulate the words
for word in "${!uniq[@]}"; do
((wordcount[$word]++))
done
unset uniq
done < file
Looking at the data:
$ declare -p wordcount
declare -A wordcount='([possible]="1" [one]="1" [different]="1" [this]="1" [a]="1" [hello]="1" [world]="2" [man]="2" [0]="1" [1]="1" [2]="1" [is]="3" [the]="3" )'
and formatting as you want:
$ printf "%sn" "${!wordcount[@]}" | sort | while read key; do echo "$key:${wordcount[$key]}"; done
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
answered Jan 4 at 16:42
glenn jackmanglenn jackman
52.3k572113
52.3k572113
add a comment |
add a comment |
It's a pretty straight-forward perl script:
#!/usr/bin/perl -w
use strict;
my %words = ();
while (<>) {
chomp;
my %linewords = ();
map { $linewords{$_}=1 } split / /;
foreach my $word (keys %linewords) {
$words{$word}++;
}
}
foreach my $word (sort keys %words) {
print "$word:$words{$word}n";
}
The basic idea is to loop over the input; for each line, split it into words, then save those words into a hash (associative array) in order to remove any duplicates, then loop over that array of words and add one to an overall counter for that word. At the end, report on the words and their counts.
1
A slight problem with this is in my opinion that it does not respect what the usual definition of a word is, since it splits on a single space character. If two spaces were found somewhere, an empty string inbetween would be considered a word as well if I'm not mistaken. Let alone if words were separated by other punctuation characters. Of course, it was not specified in the question whether "word" is understood as the programmer's concept of a "word", or as a word of a natural language.
– Larry
Jan 4 at 16:38
add a comment |
It's a pretty straight-forward perl script:
#!/usr/bin/perl -w
use strict;
my %words = ();
while (<>) {
chomp;
my %linewords = ();
map { $linewords{$_}=1 } split / /;
foreach my $word (keys %linewords) {
$words{$word}++;
}
}
foreach my $word (sort keys %words) {
print "$word:$words{$word}n";
}
The basic idea is to loop over the input; for each line, split it into words, then save those words into a hash (associative array) in order to remove any duplicates, then loop over that array of words and add one to an overall counter for that word. At the end, report on the words and their counts.
1
A slight problem with this is in my opinion that it does not respect what the usual definition of a word is, since it splits on a single space character. If two spaces were found somewhere, an empty string inbetween would be considered a word as well if I'm not mistaken. Let alone if words were separated by other punctuation characters. Of course, it was not specified in the question whether "word" is understood as the programmer's concept of a "word", or as a word of a natural language.
– Larry
Jan 4 at 16:38
add a comment |
It's a pretty straight-forward perl script:
#!/usr/bin/perl -w
use strict;
my %words = ();
while (<>) {
chomp;
my %linewords = ();
map { $linewords{$_}=1 } split / /;
foreach my $word (keys %linewords) {
$words{$word}++;
}
}
foreach my $word (sort keys %words) {
print "$word:$words{$word}n";
}
The basic idea is to loop over the input; for each line, split it into words, then save those words into a hash (associative array) in order to remove any duplicates, then loop over that array of words and add one to an overall counter for that word. At the end, report on the words and their counts.
It's a pretty straight-forward perl script:
#!/usr/bin/perl -w
use strict;
my %words = ();
while (<>) {
chomp;
my %linewords = ();
map { $linewords{$_}=1 } split / /;
foreach my $word (keys %linewords) {
$words{$word}++;
}
}
foreach my $word (sort keys %words) {
print "$word:$words{$word}n";
}
The basic idea is to loop over the input; for each line, split it into words, then save those words into a hash (associative array) in order to remove any duplicates, then loop over that array of words and add one to an overall counter for that word. At the end, report on the words and their counts.
answered Jan 4 at 15:59
Jeff SchallerJeff Schaller
43.4k1160140
43.4k1160140
1
A slight problem with this is in my opinion that it does not respect what the usual definition of a word is, since it splits on a single space character. If two spaces were found somewhere, an empty string inbetween would be considered a word as well if I'm not mistaken. Let alone if words were separated by other punctuation characters. Of course, it was not specified in the question whether "word" is understood as the programmer's concept of a "word", or as a word of a natural language.
– Larry
Jan 4 at 16:38
add a comment |
1
A slight problem with this is in my opinion that it does not respect what the usual definition of a word is, since it splits on a single space character. If two spaces were found somewhere, an empty string inbetween would be considered a word as well if I'm not mistaken. Let alone if words were separated by other punctuation characters. Of course, it was not specified in the question whether "word" is understood as the programmer's concept of a "word", or as a word of a natural language.
– Larry
Jan 4 at 16:38
1
1
A slight problem with this is in my opinion that it does not respect what the usual definition of a word is, since it splits on a single space character. If two spaces were found somewhere, an empty string inbetween would be considered a word as well if I'm not mistaken. Let alone if words were separated by other punctuation characters. Of course, it was not specified in the question whether "word" is understood as the programmer's concept of a "word", or as a word of a natural language.
– Larry
Jan 4 at 16:38
A slight problem with this is in my opinion that it does not respect what the usual definition of a word is, since it splits on a single space character. If two spaces were found somewhere, an empty string inbetween would be considered a word as well if I'm not mistaken. Let alone if words were separated by other punctuation characters. Of course, it was not specified in the question whether "word" is understood as the programmer's concept of a "word", or as a word of a natural language.
– Larry
Jan 4 at 16:38
add a comment |
A solution that calls several programs from a shell:
fmt -1 words.txt | sort -u | xargs -Ipattern sh -c 'echo "pattern:$(grep -cw pattern words.txt)"'
A little explanation:
The fmt -1 words.txt
prints out all the words, 1 per line, and the | sort -u
sorts this output and extracts only the unique words from it.
In order to count the occurences of a word in a file, one can use grep
(a tool meant to search files for patterns). By passing the -cw
option, grep gives the number of word matches it finds. So you can find the total number of occurrences of pattern
using grep -cw pattern words.txt
.
The tool xargs
allows us to do this for each and every single word output by sort
. The -Ipattern
means that it will execute the following command multiple times, replacing each occurrence of pattern with a word it reads from standard input, which is what it gets from sort
.
The indirection with sh
is needed because xargs
only knows how to execute a single program, given it's name, passing everything else as arguments to it. xargs
does not handle things like command substitution. The $(...)
is command substitution in the above snippet, as it substitutes the output from grep
into echo
, allowing it to be formatted correctly. Since we need the command substitution, we must use the sh -c
command which runs whatever it recieves as an argument in its own shell.
An optimisation to this approach:fmt -1 words.txt | sort | uniq -c | awk '{ print $2 ":" $1 }'
– matja
Jan 5 at 0:14
@matja issort | uniq -c
more efficient thansort -u
?
– vikarjramun
Jan 5 at 3:31
vikarjramun@ no, but uniq -c gives you the counts of each word in one pass, so you don't have to use xargs to do multiple passes of the input file for each word.
– matja
Jan 5 at 10:11
1
@matja: I actually made the answer you provided before the current one. However, it does not do what OP asked for. I misread the question at first entirely as well, and was corrected by glenn jackman. What you are suggesting would count every occurrence of each word. What OP asked for is to count the number of lines each word occurs in at least once.
– Larry
Jan 5 at 10:17
add a comment |
A solution that calls several programs from a shell:
fmt -1 words.txt | sort -u | xargs -Ipattern sh -c 'echo "pattern:$(grep -cw pattern words.txt)"'
A little explanation:
The fmt -1 words.txt
prints out all the words, 1 per line, and the | sort -u
sorts this output and extracts only the unique words from it.
In order to count the occurences of a word in a file, one can use grep
(a tool meant to search files for patterns). By passing the -cw
option, grep gives the number of word matches it finds. So you can find the total number of occurrences of pattern
using grep -cw pattern words.txt
.
The tool xargs
allows us to do this for each and every single word output by sort
. The -Ipattern
means that it will execute the following command multiple times, replacing each occurrence of pattern with a word it reads from standard input, which is what it gets from sort
.
The indirection with sh
is needed because xargs
only knows how to execute a single program, given it's name, passing everything else as arguments to it. xargs
does not handle things like command substitution. The $(...)
is command substitution in the above snippet, as it substitutes the output from grep
into echo
, allowing it to be formatted correctly. Since we need the command substitution, we must use the sh -c
command which runs whatever it recieves as an argument in its own shell.
An optimisation to this approach:fmt -1 words.txt | sort | uniq -c | awk '{ print $2 ":" $1 }'
– matja
Jan 5 at 0:14
@matja issort | uniq -c
more efficient thansort -u
?
– vikarjramun
Jan 5 at 3:31
vikarjramun@ no, but uniq -c gives you the counts of each word in one pass, so you don't have to use xargs to do multiple passes of the input file for each word.
– matja
Jan 5 at 10:11
1
@matja: I actually made the answer you provided before the current one. However, it does not do what OP asked for. I misread the question at first entirely as well, and was corrected by glenn jackman. What you are suggesting would count every occurrence of each word. What OP asked for is to count the number of lines each word occurs in at least once.
– Larry
Jan 5 at 10:17
add a comment |
A solution that calls several programs from a shell:
fmt -1 words.txt | sort -u | xargs -Ipattern sh -c 'echo "pattern:$(grep -cw pattern words.txt)"'
A little explanation:
The fmt -1 words.txt
prints out all the words, 1 per line, and the | sort -u
sorts this output and extracts only the unique words from it.
In order to count the occurences of a word in a file, one can use grep
(a tool meant to search files for patterns). By passing the -cw
option, grep gives the number of word matches it finds. So you can find the total number of occurrences of pattern
using grep -cw pattern words.txt
.
The tool xargs
allows us to do this for each and every single word output by sort
. The -Ipattern
means that it will execute the following command multiple times, replacing each occurrence of pattern with a word it reads from standard input, which is what it gets from sort
.
The indirection with sh
is needed because xargs
only knows how to execute a single program, given it's name, passing everything else as arguments to it. xargs
does not handle things like command substitution. The $(...)
is command substitution in the above snippet, as it substitutes the output from grep
into echo
, allowing it to be formatted correctly. Since we need the command substitution, we must use the sh -c
command which runs whatever it recieves as an argument in its own shell.
A solution that calls several programs from a shell:
fmt -1 words.txt | sort -u | xargs -Ipattern sh -c 'echo "pattern:$(grep -cw pattern words.txt)"'
A little explanation:
The fmt -1 words.txt
prints out all the words, 1 per line, and the | sort -u
sorts this output and extracts only the unique words from it.
In order to count the occurences of a word in a file, one can use grep
(a tool meant to search files for patterns). By passing the -cw
option, grep gives the number of word matches it finds. So you can find the total number of occurrences of pattern
using grep -cw pattern words.txt
.
The tool xargs
allows us to do this for each and every single word output by sort
. The -Ipattern
means that it will execute the following command multiple times, replacing each occurrence of pattern with a word it reads from standard input, which is what it gets from sort
.
The indirection with sh
is needed because xargs
only knows how to execute a single program, given it's name, passing everything else as arguments to it. xargs
does not handle things like command substitution. The $(...)
is command substitution in the above snippet, as it substitutes the output from grep
into echo
, allowing it to be formatted correctly. Since we need the command substitution, we must use the sh -c
command which runs whatever it recieves as an argument in its own shell.
edited Jan 4 at 21:13
vikarjramun
1428
1428
answered Jan 4 at 17:33
LarryLarry
1265
1265
An optimisation to this approach:fmt -1 words.txt | sort | uniq -c | awk '{ print $2 ":" $1 }'
– matja
Jan 5 at 0:14
@matja issort | uniq -c
more efficient thansort -u
?
– vikarjramun
Jan 5 at 3:31
vikarjramun@ no, but uniq -c gives you the counts of each word in one pass, so you don't have to use xargs to do multiple passes of the input file for each word.
– matja
Jan 5 at 10:11
1
@matja: I actually made the answer you provided before the current one. However, it does not do what OP asked for. I misread the question at first entirely as well, and was corrected by glenn jackman. What you are suggesting would count every occurrence of each word. What OP asked for is to count the number of lines each word occurs in at least once.
– Larry
Jan 5 at 10:17
add a comment |
An optimisation to this approach:fmt -1 words.txt | sort | uniq -c | awk '{ print $2 ":" $1 }'
– matja
Jan 5 at 0:14
@matja issort | uniq -c
more efficient thansort -u
?
– vikarjramun
Jan 5 at 3:31
vikarjramun@ no, but uniq -c gives you the counts of each word in one pass, so you don't have to use xargs to do multiple passes of the input file for each word.
– matja
Jan 5 at 10:11
1
@matja: I actually made the answer you provided before the current one. However, it does not do what OP asked for. I misread the question at first entirely as well, and was corrected by glenn jackman. What you are suggesting would count every occurrence of each word. What OP asked for is to count the number of lines each word occurs in at least once.
– Larry
Jan 5 at 10:17
An optimisation to this approach:
fmt -1 words.txt | sort | uniq -c | awk '{ print $2 ":" $1 }'
– matja
Jan 5 at 0:14
An optimisation to this approach:
fmt -1 words.txt | sort | uniq -c | awk '{ print $2 ":" $1 }'
– matja
Jan 5 at 0:14
@matja is
sort | uniq -c
more efficient than sort -u
?– vikarjramun
Jan 5 at 3:31
@matja is
sort | uniq -c
more efficient than sort -u
?– vikarjramun
Jan 5 at 3:31
vikarjramun@ no, but uniq -c gives you the counts of each word in one pass, so you don't have to use xargs to do multiple passes of the input file for each word.
– matja
Jan 5 at 10:11
vikarjramun@ no, but uniq -c gives you the counts of each word in one pass, so you don't have to use xargs to do multiple passes of the input file for each word.
– matja
Jan 5 at 10:11
1
1
@matja: I actually made the answer you provided before the current one. However, it does not do what OP asked for. I misread the question at first entirely as well, and was corrected by glenn jackman. What you are suggesting would count every occurrence of each word. What OP asked for is to count the number of lines each word occurs in at least once.
– Larry
Jan 5 at 10:17
@matja: I actually made the answer you provided before the current one. However, it does not do what OP asked for. I misread the question at first entirely as well, and was corrected by glenn jackman. What you are suggesting would count every occurrence of each word. What OP asked for is to count the number of lines each word occurs in at least once.
– Larry
Jan 5 at 10:17
add a comment |
Another simple alternative would be to use Python (>3.6). This solution has the same problem as the one mentioned by @Larry in his comment.
from collections import Counter
with open("words.txt") as f:
c = Counter(word for line in [line.strip().split() for line in f] for word in set(line))
for word, occurrence in sorted(c.items()):
print(f'{word}:{occurrence}')
# for Python 2.7.x compatibility you can replace the above line with
# the following one:
# print('{}:{}'.format(word, occurrence))
A more explicit version version of the above:
from collections import Counter
FILENAME = "words.txt"
def find_unique_words():
with open(FILENAME) as f:
lines = [line.strip().split() for line in f]
unique_words = Counter(word for line in lines for word in set(line))
return sorted(unique_words.items())
def print_unique_words():
unique_words = find_unique_words()
for word, occurrence in unique_words:
print(f'{word}:{occurrence}')
def main():
print_unique_words()
if __name__ == '__main__':
main()
Output:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
The above also assumes that words.txt is on the same directory as script.py. Note that this is not much different from other solutions provided here, but perhaps somebody will find it useful.
add a comment |
Another simple alternative would be to use Python (>3.6). This solution has the same problem as the one mentioned by @Larry in his comment.
from collections import Counter
with open("words.txt") as f:
c = Counter(word for line in [line.strip().split() for line in f] for word in set(line))
for word, occurrence in sorted(c.items()):
print(f'{word}:{occurrence}')
# for Python 2.7.x compatibility you can replace the above line with
# the following one:
# print('{}:{}'.format(word, occurrence))
A more explicit version version of the above:
from collections import Counter
FILENAME = "words.txt"
def find_unique_words():
with open(FILENAME) as f:
lines = [line.strip().split() for line in f]
unique_words = Counter(word for line in lines for word in set(line))
return sorted(unique_words.items())
def print_unique_words():
unique_words = find_unique_words()
for word, occurrence in unique_words:
print(f'{word}:{occurrence}')
def main():
print_unique_words()
if __name__ == '__main__':
main()
Output:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
The above also assumes that words.txt is on the same directory as script.py. Note that this is not much different from other solutions provided here, but perhaps somebody will find it useful.
add a comment |
Another simple alternative would be to use Python (>3.6). This solution has the same problem as the one mentioned by @Larry in his comment.
from collections import Counter
with open("words.txt") as f:
c = Counter(word for line in [line.strip().split() for line in f] for word in set(line))
for word, occurrence in sorted(c.items()):
print(f'{word}:{occurrence}')
# for Python 2.7.x compatibility you can replace the above line with
# the following one:
# print('{}:{}'.format(word, occurrence))
A more explicit version version of the above:
from collections import Counter
FILENAME = "words.txt"
def find_unique_words():
with open(FILENAME) as f:
lines = [line.strip().split() for line in f]
unique_words = Counter(word for line in lines for word in set(line))
return sorted(unique_words.items())
def print_unique_words():
unique_words = find_unique_words()
for word, occurrence in unique_words:
print(f'{word}:{occurrence}')
def main():
print_unique_words()
if __name__ == '__main__':
main()
Output:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
The above also assumes that words.txt is on the same directory as script.py. Note that this is not much different from other solutions provided here, but perhaps somebody will find it useful.
Another simple alternative would be to use Python (>3.6). This solution has the same problem as the one mentioned by @Larry in his comment.
from collections import Counter
with open("words.txt") as f:
c = Counter(word for line in [line.strip().split() for line in f] for word in set(line))
for word, occurrence in sorted(c.items()):
print(f'{word}:{occurrence}')
# for Python 2.7.x compatibility you can replace the above line with
# the following one:
# print('{}:{}'.format(word, occurrence))
A more explicit version version of the above:
from collections import Counter
FILENAME = "words.txt"
def find_unique_words():
with open(FILENAME) as f:
lines = [line.strip().split() for line in f]
unique_words = Counter(word for line in lines for word in set(line))
return sorted(unique_words.items())
def print_unique_words():
unique_words = find_unique_words()
for word, occurrence in unique_words:
print(f'{word}:{occurrence}')
def main():
print_unique_words()
if __name__ == '__main__':
main()
Output:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
The above also assumes that words.txt is on the same directory as script.py. Note that this is not much different from other solutions provided here, but perhaps somebody will find it useful.
edited Jan 5 at 12:37
David Foerster
1,019717
1,019717
answered Jan 4 at 20:57
яүυкяүυк
1247
1247
add a comment |
add a comment |
Trying to do it with awk:
count.awk:
#!/usr/bin/awk -f
# count line containing word
{
for (i = 1 ; i <= NF ; i++) {
word_in_a_line[$i] ++
if (word_in_a_line[$i] == 1) {
word_line_count[$i] ++
}
}
delete word_in_a_line
}
END {
for (word in word_line_count){
printf "%s:%dn",word,word_line_count[word]
}
}
Run it by:
$ awk -f count.awk ./test.data | sort
add a comment |
Trying to do it with awk:
count.awk:
#!/usr/bin/awk -f
# count line containing word
{
for (i = 1 ; i <= NF ; i++) {
word_in_a_line[$i] ++
if (word_in_a_line[$i] == 1) {
word_line_count[$i] ++
}
}
delete word_in_a_line
}
END {
for (word in word_line_count){
printf "%s:%dn",word,word_line_count[word]
}
}
Run it by:
$ awk -f count.awk ./test.data | sort
add a comment |
Trying to do it with awk:
count.awk:
#!/usr/bin/awk -f
# count line containing word
{
for (i = 1 ; i <= NF ; i++) {
word_in_a_line[$i] ++
if (word_in_a_line[$i] == 1) {
word_line_count[$i] ++
}
}
delete word_in_a_line
}
END {
for (word in word_line_count){
printf "%s:%dn",word,word_line_count[word]
}
}
Run it by:
$ awk -f count.awk ./test.data | sort
Trying to do it with awk:
count.awk:
#!/usr/bin/awk -f
# count line containing word
{
for (i = 1 ; i <= NF ; i++) {
word_in_a_line[$i] ++
if (word_in_a_line[$i] == 1) {
word_line_count[$i] ++
}
}
delete word_in_a_line
}
END {
for (word in word_line_count){
printf "%s:%dn",word,word_line_count[word]
}
}
Run it by:
$ awk -f count.awk ./test.data | sort
answered Jan 6 at 1:26
CharlesCharles
32818
32818
add a comment |
add a comment |
A pure bash answer
echo "0 hello world the man is world
1 this is the world
2 a different man is the possible one" | while IFS=$'n' read -r line; do echo $line | tr ' ' 'n' | sort -u; done | sort | uniq -c
1 0
1 1
1 2
1 a
1 different
1 hello
3 is
2 man
1 one
1 possible
3 the
1 this
2 world
I looped unique words on each line and passed it to uniq -c
edit: I did not see glenn's answer. I found it strange to not see a bash answer
add a comment |
A pure bash answer
echo "0 hello world the man is world
1 this is the world
2 a different man is the possible one" | while IFS=$'n' read -r line; do echo $line | tr ' ' 'n' | sort -u; done | sort | uniq -c
1 0
1 1
1 2
1 a
1 different
1 hello
3 is
2 man
1 one
1 possible
3 the
1 this
2 world
I looped unique words on each line and passed it to uniq -c
edit: I did not see glenn's answer. I found it strange to not see a bash answer
add a comment |
A pure bash answer
echo "0 hello world the man is world
1 this is the world
2 a different man is the possible one" | while IFS=$'n' read -r line; do echo $line | tr ' ' 'n' | sort -u; done | sort | uniq -c
1 0
1 1
1 2
1 a
1 different
1 hello
3 is
2 man
1 one
1 possible
3 the
1 this
2 world
I looped unique words on each line and passed it to uniq -c
edit: I did not see glenn's answer. I found it strange to not see a bash answer
A pure bash answer
echo "0 hello world the man is world
1 this is the world
2 a different man is the possible one" | while IFS=$'n' read -r line; do echo $line | tr ' ' 'n' | sort -u; done | sort | uniq -c
1 0
1 1
1 2
1 a
1 different
1 hello
3 is
2 man
1 one
1 possible
3 the
1 this
2 world
I looped unique words on each line and passed it to uniq -c
edit: I did not see glenn's answer. I found it strange to not see a bash answer
edited Jan 6 at 4:57
answered Jan 6 at 4:48
user1462442user1462442
1214
1214
add a comment |
add a comment |
Simple, though doesn't care if it reads the file many times:
sed 's/ /n/g' file.txt | sort | uniq | while read -r word; do
printf "%s:%dn" "$word" "$(grep -Fw "$word" file.txt | wc -l)"
done
EDIT: Despite converting spaces to newlines, this does count lines that have an occurrence of each word and not the occurrences of the words themselves. It gives the result:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
which is character-by-character identical to OP's example result.
1
Read the question again. It literally saystranslating blanks to newline chars wouldn't be the exact solution
.
– Sparhawk
Jan 5 at 9:59
@Sparhawk Read the answer again. This does give the answer he gave as example, including giving the result of 2 instead of 3 for world. He meant that doing something likesed 's/ /n/g' | sort | uniq -c
would not work because it'd give the answer 3 for world, but that's not what this answer does. It correctly counts the lines where the words occur and not the occurrences themselves, just like OP wanted.
– JoL
Jan 6 at 7:03
Ah right, apologies! I would recommend putting in an explanation of your code, which is both helpful to the questioner, and clarifies what it does. Also, as a minor point, you probably wantread -r
here.
– Sparhawk
Jan 6 at 9:38
add a comment |
Simple, though doesn't care if it reads the file many times:
sed 's/ /n/g' file.txt | sort | uniq | while read -r word; do
printf "%s:%dn" "$word" "$(grep -Fw "$word" file.txt | wc -l)"
done
EDIT: Despite converting spaces to newlines, this does count lines that have an occurrence of each word and not the occurrences of the words themselves. It gives the result:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
which is character-by-character identical to OP's example result.
1
Read the question again. It literally saystranslating blanks to newline chars wouldn't be the exact solution
.
– Sparhawk
Jan 5 at 9:59
@Sparhawk Read the answer again. This does give the answer he gave as example, including giving the result of 2 instead of 3 for world. He meant that doing something likesed 's/ /n/g' | sort | uniq -c
would not work because it'd give the answer 3 for world, but that's not what this answer does. It correctly counts the lines where the words occur and not the occurrences themselves, just like OP wanted.
– JoL
Jan 6 at 7:03
Ah right, apologies! I would recommend putting in an explanation of your code, which is both helpful to the questioner, and clarifies what it does. Also, as a minor point, you probably wantread -r
here.
– Sparhawk
Jan 6 at 9:38
add a comment |
Simple, though doesn't care if it reads the file many times:
sed 's/ /n/g' file.txt | sort | uniq | while read -r word; do
printf "%s:%dn" "$word" "$(grep -Fw "$word" file.txt | wc -l)"
done
EDIT: Despite converting spaces to newlines, this does count lines that have an occurrence of each word and not the occurrences of the words themselves. It gives the result:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
which is character-by-character identical to OP's example result.
Simple, though doesn't care if it reads the file many times:
sed 's/ /n/g' file.txt | sort | uniq | while read -r word; do
printf "%s:%dn" "$word" "$(grep -Fw "$word" file.txt | wc -l)"
done
EDIT: Despite converting spaces to newlines, this does count lines that have an occurrence of each word and not the occurrences of the words themselves. It gives the result:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
which is character-by-character identical to OP's example result.
edited Jan 7 at 16:54
answered Jan 5 at 2:03
JoLJoL
1,146311
1,146311
1
Read the question again. It literally saystranslating blanks to newline chars wouldn't be the exact solution
.
– Sparhawk
Jan 5 at 9:59
@Sparhawk Read the answer again. This does give the answer he gave as example, including giving the result of 2 instead of 3 for world. He meant that doing something likesed 's/ /n/g' | sort | uniq -c
would not work because it'd give the answer 3 for world, but that's not what this answer does. It correctly counts the lines where the words occur and not the occurrences themselves, just like OP wanted.
– JoL
Jan 6 at 7:03
Ah right, apologies! I would recommend putting in an explanation of your code, which is both helpful to the questioner, and clarifies what it does. Also, as a minor point, you probably wantread -r
here.
– Sparhawk
Jan 6 at 9:38
add a comment |
1
Read the question again. It literally saystranslating blanks to newline chars wouldn't be the exact solution
.
– Sparhawk
Jan 5 at 9:59
@Sparhawk Read the answer again. This does give the answer he gave as example, including giving the result of 2 instead of 3 for world. He meant that doing something likesed 's/ /n/g' | sort | uniq -c
would not work because it'd give the answer 3 for world, but that's not what this answer does. It correctly counts the lines where the words occur and not the occurrences themselves, just like OP wanted.
– JoL
Jan 6 at 7:03
Ah right, apologies! I would recommend putting in an explanation of your code, which is both helpful to the questioner, and clarifies what it does. Also, as a minor point, you probably wantread -r
here.
– Sparhawk
Jan 6 at 9:38
1
1
Read the question again. It literally says
translating blanks to newline chars wouldn't be the exact solution
.– Sparhawk
Jan 5 at 9:59
Read the question again. It literally says
translating blanks to newline chars wouldn't be the exact solution
.– Sparhawk
Jan 5 at 9:59
@Sparhawk Read the answer again. This does give the answer he gave as example, including giving the result of 2 instead of 3 for world. He meant that doing something like
sed 's/ /n/g' | sort | uniq -c
would not work because it'd give the answer 3 for world, but that's not what this answer does. It correctly counts the lines where the words occur and not the occurrences themselves, just like OP wanted.– JoL
Jan 6 at 7:03
@Sparhawk Read the answer again. This does give the answer he gave as example, including giving the result of 2 instead of 3 for world. He meant that doing something like
sed 's/ /n/g' | sort | uniq -c
would not work because it'd give the answer 3 for world, but that's not what this answer does. It correctly counts the lines where the words occur and not the occurrences themselves, just like OP wanted.– JoL
Jan 6 at 7:03
Ah right, apologies! I would recommend putting in an explanation of your code, which is both helpful to the questioner, and clarifies what it does. Also, as a minor point, you probably want
read -r
here.– Sparhawk
Jan 6 at 9:38
Ah right, apologies! I would recommend putting in an explanation of your code, which is both helpful to the questioner, and clarifies what it does. Also, as a minor point, you probably want
read -r
here.– Sparhawk
Jan 6 at 9:38
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f492501%2fcount-lines-containing-word%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What have you try to the moment?
– Romeo Ninov
Jan 4 at 15:28
This seems highly relevant: unix.stackexchange.com/a/332890/224077
– Panki
Jan 4 at 15:41